I think this caveat has been added in the dev doc (not yet in the stable doc). You may want to read: https://scikit-learn.org/dev/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html and in particular the part that starts with "A common mistake is to pass in a list".
Cheers, Loïc > Hi. > > I am working on a Multi-label text classification problem. In order to encode > the labels, I am using MultiLabelBinarizer. The labels of the dataset look > like - > > image > > When I am using > > mlb = MultiLabelBinarizer() > mlb.fit(labels) > print(mlb.classes_) > > I am getting - > > image > > Whereas, the output (sample output) I want is - > > image > > I got the above output by - > > mlb = MultiLabelBinarizer() > sample_labels = [ > ['stat.ML', 'cs.LG'], > ['cs.CV', 'cs.RO'] > ] > mlb.fit(sample_labels) > print(mlb.classes_) > > Help would be very much appreciated here. > > Here's the dataset I had prepared: > arXivdata.csv.zip > > I stripped away the double quotes in the labels after loading it in a pandas > DataFrame by - > > import re > > arxiv_data['labels'] = arxiv_data['labels'].str.replace(r"[\"]", '') > > scikit-learn version: '0.21.3' > > Sayak Paul | sayak.dev _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn