Re: [Scikit-learn-general] (no subject)

Sujit Pal Thu, 10 Dec 2015 09:40:08 -0800

Hi Mukesh,

I was getting the following error from your code on my environment (Python
2.7.11 - Anaconda 2.4.1, scikit-learn 0.17) on Mac OSX 10.9 for the
following line:


    Y = lb.fit_transform(y_train_text)
> ValueError: You appear to be using a legacy multi-label data
> representation. Sequence of sequences are no longer supported; use a binary
> array or sparse matrix instead.


To fix, I did this:

y_train_text0 = [["new york"],["new york"],["new york"],["new york"],["new
> york"],
>                 ["new york"],["london"],["london"],["london"],["london"],
>                 ["london"],["london"],["new york","london"],["new
> york","london"]]
> y_train_text = [x[0] for x in y_train_text0]


and a cosmetic fix here:

for item, labels in zip(X_test, all_labels):
>     print '%s => %s' % (item, labels)


and now getting following result:

nice day in nyc => new york
> welcome to london => london
> london is rainy => london
> it is raining in britian => london
> it is raining in britian and the big apple => new york
> it is raining in britian and nyc => new york
> hello welcome to new york. enjoy it here and london too => new york


-sujit


On Thu, Dec 10, 2015 at 4:08 AM, mukesh tiwari <[email protected]
> wrote:

> Hello Everyone,
> I am trying to learn scikit and my problem is somewhat related to this
> problem [1]. When I am trying to run the code
>
> import numpy as npfrom sklearn.pipeline import Pipelinefrom 
> sklearn.feature_extraction.text import CountVectorizerfrom sklearn.svm import 
> LinearSVCfrom sklearn.feature_extraction.text import TfidfTransformerfrom 
> sklearn.multiclass import OneVsRestClassifierfrom sklearn import preprocessing
>
> X_train = np.array(["new york is a hell of a town",
>                     "new york was originally dutch",
>                     "the big apple is great",
>                     "new york is also called the big apple",
>                     "nyc is nice",
>                     "people abbreviate new york city as nyc",
>                     "the capital of great britain is london",
>                     "london is in the uk",
>                     "london is in england",
>                     "london is in great britain",
>                     "it rains a lot in london",
>                     "london hosts the british museum",
>                     "new york is great and so is london",
>                     "i like london better than new york"])
> y_train_text = [["new york"],["new york"],["new york"],["new york"],["new 
> york"],
>                 ["new york"],["london"],["london"],["london"],["london"],
>                 ["london"],["london"],["new york","london"],["new 
> york","london"]]
>
> X_test = np.array(['nice day in nyc',
>                    'welcome to london',
>                    'london is rainy',
>                    'it is raining in britian',
>                    'it is raining in britian and the big apple',
>                    'it is raining in britian and nyc',
>                    'hello welcome to new york. enjoy it here and london too'])
> target_names = ['New York', 'London']
>
> lb = preprocessing.LabelBinarizer()
> Y = lb.fit_transform(y_train_text)
>
> classifier = Pipeline([
>     ('vectorizer', CountVectorizer()),
>     ('tfidf', TfidfTransformer()),
>     ('clf', OneVsRestClassifier(LinearSVC()))])
>
> classifier.fit(X_train, Y)
> predicted = classifier.predict(X_test)
> all_labels = lb.inverse_transform(predicted)
> for item, labels in zip(X_test, all_labels):
>     print '%s => %s' % (item, ', '.join(labels))
>
>
> I am getting
> Traceback (most recent call last):
> File "phrase.py", line 37, in <module>
> Y = lb.fit_transform(y_train_text)
> File "/Library/Python/2.7/site-packages/sklearn/base.py", line 455, in 
> fit_transform
> return self.fit(X, **fit_params).transform(X)
> File "/Library/Python/2.7/site-packages/sklearn/preprocessing/label.py", line 
> 300, in fit
> self.y_type_ = type_of_target(y)
> File "/Library/Python/2.7/site-packages/sklearn/utils/multiclass.py", line 
> 251, in type_of_target
> raise ValueError('You appear to be using a legacy multi-label data'
> ValueError: You appear to be using a legacy multi-label data representation. 
> Sequence of sequences are no longer supported; use a binary array or sparse 
> matrix instead.
>
> I tried to change the
>
> y_train_text = [["new york"],["new york"],["new york"],["new york"],["new 
> york"],
>                 ["new york"],["london"],["london"],["london"],["london"],
>                 ["london"],["london"],["new york","london"],["new 
> york","london"]]
>
> to y_train_text = [[1,0], [1,0], [1,0], [1,0], [1,0],
>                    [1,0], [0,1], [0,1], [0,1], [0,1],
>                    [0,1], [0,1], [1,1], [1,1]]
>
> then I am getting
> ValueError: Multioutput target data is not supported with label binarization
>
> Could some one please tell me how to resolve this.
>
> Best regards,
> Mukesh Tiwari
>
>
> [1]
> http://stackoverflow.com/questions/10526579/use-scikit-learn-to-classify-into-multiple-categories
>
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>

------------------------------------------------------------------------------

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] (no subject)

Reply via email to