Hi Mukesh,
I was getting the following error from your code on my environment (Python
2.7.11 - Anaconda 2.4.1, scikit-learn 0.17) on Mac OSX 10.9 for the
following line:
Y = lb.fit_transform(y_train_text)
> ValueError: You appear to be using a legacy multi-label data
> representation. Sequence of sequences are no longer supported; use a binary
> array or sparse matrix instead.
To fix, I did this:
y_train_text0 = [["new york"],["new york"],["new york"],["new york"],["new
> york"],
> ["new york"],["london"],["london"],["london"],["london"],
> ["london"],["london"],["new york","london"],["new
> york","london"]]
> y_train_text = [x[0] for x in y_train_text0]
and a cosmetic fix here:
for item, labels in zip(X_test, all_labels):
> print '%s => %s' % (item, labels)
and now getting following result:
nice day in nyc => new york
> welcome to london => london
> london is rainy => london
> it is raining in britian => london
> it is raining in britian and the big apple => new york
> it is raining in britian and nyc => new york
> hello welcome to new york. enjoy it here and london too => new york
-sujit
On Thu, Dec 10, 2015 at 4:08 AM, mukesh tiwari <[email protected]
> wrote:
> Hello Everyone,
> I am trying to learn scikit and my problem is somewhat related to this
> problem [1]. When I am trying to run the code
>
> import numpy as npfrom sklearn.pipeline import Pipelinefrom
> sklearn.feature_extraction.text import CountVectorizerfrom sklearn.svm import
> LinearSVCfrom sklearn.feature_extraction.text import TfidfTransformerfrom
> sklearn.multiclass import OneVsRestClassifierfrom sklearn import preprocessing
>
> X_train = np.array(["new york is a hell of a town",
> "new york was originally dutch",
> "the big apple is great",
> "new york is also called the big apple",
> "nyc is nice",
> "people abbreviate new york city as nyc",
> "the capital of great britain is london",
> "london is in the uk",
> "london is in england",
> "london is in great britain",
> "it rains a lot in london",
> "london hosts the british museum",
> "new york is great and so is london",
> "i like london better than new york"])
> y_train_text = [["new york"],["new york"],["new york"],["new york"],["new
> york"],
> ["new york"],["london"],["london"],["london"],["london"],
> ["london"],["london"],["new york","london"],["new
> york","london"]]
>
> X_test = np.array(['nice day in nyc',
> 'welcome to london',
> 'london is rainy',
> 'it is raining in britian',
> 'it is raining in britian and the big apple',
> 'it is raining in britian and nyc',
> 'hello welcome to new york. enjoy it here and london too'])
> target_names = ['New York', 'London']
>
> lb = preprocessing.LabelBinarizer()
> Y = lb.fit_transform(y_train_text)
>
> classifier = Pipeline([
> ('vectorizer', CountVectorizer()),
> ('tfidf', TfidfTransformer()),
> ('clf', OneVsRestClassifier(LinearSVC()))])
>
> classifier.fit(X_train, Y)
> predicted = classifier.predict(X_test)
> all_labels = lb.inverse_transform(predicted)
> for item, labels in zip(X_test, all_labels):
> print '%s => %s' % (item, ', '.join(labels))
>
>
> I am getting
> Traceback (most recent call last):
> File "phrase.py", line 37, in <module>
> Y = lb.fit_transform(y_train_text)
> File "/Library/Python/2.7/site-packages/sklearn/base.py", line 455, in
> fit_transform
> return self.fit(X, **fit_params).transform(X)
> File "/Library/Python/2.7/site-packages/sklearn/preprocessing/label.py", line
> 300, in fit
> self.y_type_ = type_of_target(y)
> File "/Library/Python/2.7/site-packages/sklearn/utils/multiclass.py", line
> 251, in type_of_target
> raise ValueError('You appear to be using a legacy multi-label data'
> ValueError: You appear to be using a legacy multi-label data representation.
> Sequence of sequences are no longer supported; use a binary array or sparse
> matrix instead.
>
> I tried to change the
>
> y_train_text = [["new york"],["new york"],["new york"],["new york"],["new
> york"],
> ["new york"],["london"],["london"],["london"],["london"],
> ["london"],["london"],["new york","london"],["new
> york","london"]]
>
> to y_train_text = [[1,0], [1,0], [1,0], [1,0], [1,0],
> [1,0], [0,1], [0,1], [0,1], [0,1],
> [0,1], [0,1], [1,1], [1,1]]
>
> then I am getting
> ValueError: Multioutput target data is not supported with label binarization
>
> Could some one please tell me how to resolve this.
>
> Best regards,
> Mukesh Tiwari
>
>
> [1]
> http://stackoverflow.com/questions/10526579/use-scikit-learn-to-classify-into-multiple-categories
>
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general