I think this caveat has been added in the dev doc (not yet in the stable
doc). You may want to read:
https://scikit-learn.org/dev/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html
and in particular the part that starts with "A common mistake is to pass
in a list".

Cheers,
Loïc

> Hi.
>
> I am working on a Multi-label text classification problem. In order to encode 
> the labels, I am using MultiLabelBinarizer. The labels of the dataset look 
> like -
>
> image
>
> When I am using
>
> mlb = MultiLabelBinarizer()
> mlb.fit(labels)
> print(mlb.classes_)
>
> I am getting -
>
> image
>
> Whereas, the output (sample output) I want is -
>
> image
>
> I got the above output by -
>
> mlb = MultiLabelBinarizer()
> sample_labels = [
>     ['stat.ML', 'cs.LG'],
>     ['cs.CV', 'cs.RO']
> ]
> mlb.fit(sample_labels)
> print(mlb.classes_)
>
> Help would be very much appreciated here.
>
> Here's the dataset I had prepared:
> arXivdata.csv.zip
>
> I stripped away the double quotes in the labels after loading it in a pandas 
> DataFrame by -
>
> import re 
>
> arxiv_data['labels'] = arxiv_data['labels'].str.replace(r"[\"]", '')
>
> scikit-learn version: '0.21.3'
>
> Sayak Paul | sayak.dev

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to