Hi. I am working on a Multi-label text classification problem. In order to encode the labels, I am using MultiLabelBinarizer. The labels of the dataset look like -
[image: image] <https://user-images.githubusercontent.com/22957388/64753547-42b10a00-d541-11e9-80b2-f0a9245df327.png> When I am using mlb = MultiLabelBinarizer() mlb.fit(labels)print(mlb.classes_) I am getting - [image: image] <https://user-images.githubusercontent.com/22957388/64753625-78ee8980-d541-11e9-8833-a17769f1bf47.png> Whereas, the output (sample output) I want is - [image: image] <https://user-images.githubusercontent.com/22957388/64753641-89066900-d541-11e9-98fb-fb9f9e1e7305.png> I got the above output by - mlb = MultiLabelBinarizer() sample_labels = [ ['stat.ML', 'cs.LG'], ['cs.CV', 'cs.RO'] ] mlb.fit(sample_labels)print(mlb.classes_) Help would be very much appreciated here. Here's the dataset I had prepared: arXivdata.csv.zip <https://github.com/scikit-learn/scikit-learn/files/3603687/arXivdata.csv.zip> I stripped away the double quotes in the labels after loading it in a pandas DataFrame by - import re arxiv_data['labels'] = arxiv_data['labels'].str.replace(r"[\"]", '') scikit-learn version: '0.21.3' Sayak Paul | sayak.dev
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn