Hi Mukesh, I was getting the following error from your code on my environment (Python 2.7.11 - Anaconda 2.4.1, scikit-learn 0.17) on Mac OSX 10.9 for the following line:
Y = lb.fit_transform(y_train_text) > ValueError: You appear to be using a legacy multi-label data > representation. Sequence of sequences are no longer supported; use a binary > array or sparse matrix instead. To fix, I did this: y_train_text0 = [["new york"],["new york"],["new york"],["new york"],["new > york"], > ["new york"],["london"],["london"],["london"],["london"], > ["london"],["london"],["new york","london"],["new > york","london"]] > y_train_text = [x[0] for x in y_train_text0] and a cosmetic fix here: for item, labels in zip(X_test, all_labels): > print '%s => %s' % (item, labels) and now getting following result: nice day in nyc => new york > welcome to london => london > london is rainy => london > it is raining in britian => london > it is raining in britian and the big apple => new york > it is raining in britian and nyc => new york > hello welcome to new york. enjoy it here and london too => new york -sujit On Thu, Dec 10, 2015 at 4:08 AM, mukesh tiwari <mukeshtiwari.ii...@gmail.com > wrote: > Hello Everyone, > I am trying to learn scikit and my problem is somewhat related to this > problem [1]. When I am trying to run the code > > import numpy as npfrom sklearn.pipeline import Pipelinefrom > sklearn.feature_extraction.text import CountVectorizerfrom sklearn.svm import > LinearSVCfrom sklearn.feature_extraction.text import TfidfTransformerfrom > sklearn.multiclass import OneVsRestClassifierfrom sklearn import preprocessing > > X_train = np.array(["new york is a hell of a town", > "new york was originally dutch", > "the big apple is great", > "new york is also called the big apple", > "nyc is nice", > "people abbreviate new york city as nyc", > "the capital of great britain is london", > "london is in the uk", > "london is in england", > "london is in great britain", > "it rains a lot in london", > "london hosts the british museum", > "new york is great and so is london", > "i like london better than new york"]) > y_train_text = [["new york"],["new york"],["new york"],["new york"],["new > york"], > ["new york"],["london"],["london"],["london"],["london"], > ["london"],["london"],["new york","london"],["new > york","london"]] > > X_test = np.array(['nice day in nyc', > 'welcome to london', > 'london is rainy', > 'it is raining in britian', > 'it is raining in britian and the big apple', > 'it is raining in britian and nyc', > 'hello welcome to new york. enjoy it here and london too']) > target_names = ['New York', 'London'] > > lb = preprocessing.LabelBinarizer() > Y = lb.fit_transform(y_train_text) > > classifier = Pipeline([ > ('vectorizer', CountVectorizer()), > ('tfidf', TfidfTransformer()), > ('clf', OneVsRestClassifier(LinearSVC()))]) > > classifier.fit(X_train, Y) > predicted = classifier.predict(X_test) > all_labels = lb.inverse_transform(predicted) > for item, labels in zip(X_test, all_labels): > print '%s => %s' % (item, ', '.join(labels)) > > > I am getting > Traceback (most recent call last): > File "phrase.py", line 37, in <module> > Y = lb.fit_transform(y_train_text) > File "/Library/Python/2.7/site-packages/sklearn/base.py", line 455, in > fit_transform > return self.fit(X, **fit_params).transform(X) > File "/Library/Python/2.7/site-packages/sklearn/preprocessing/label.py", line > 300, in fit > self.y_type_ = type_of_target(y) > File "/Library/Python/2.7/site-packages/sklearn/utils/multiclass.py", line > 251, in type_of_target > raise ValueError('You appear to be using a legacy multi-label data' > ValueError: You appear to be using a legacy multi-label data representation. > Sequence of sequences are no longer supported; use a binary array or sparse > matrix instead. > > I tried to change the > > y_train_text = [["new york"],["new york"],["new york"],["new york"],["new > york"], > ["new york"],["london"],["london"],["london"],["london"], > ["london"],["london"],["new york","london"],["new > york","london"]] > > to y_train_text = [[1,0], [1,0], [1,0], [1,0], [1,0], > [1,0], [0,1], [0,1], [0,1], [0,1], > [0,1], [0,1], [1,1], [1,1]] > > then I am getting > ValueError: Multioutput target data is not supported with label binarization > > Could some one please tell me how to resolve this. > > Best regards, > Mukesh Tiwari > > > [1] > http://stackoverflow.com/questions/10526579/use-scikit-learn-to-classify-into-multiple-categories > > > ------------------------------------------------------------------------------ > > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > >
------------------------------------------------------------------------------
_______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general