Hello Everyone, I am trying to learn scikit and my problem is somewhat related to this problem [1]. When I am trying to run the code
import numpy as npfrom sklearn.pipeline import Pipelinefrom sklearn.feature_extraction.text import CountVectorizerfrom sklearn.svm import LinearSVCfrom sklearn.feature_extraction.text import TfidfTransformerfrom sklearn.multiclass import OneVsRestClassifierfrom sklearn import preprocessing X_train = np.array(["new york is a hell of a town", "new york was originally dutch", "the big apple is great", "new york is also called the big apple", "nyc is nice", "people abbreviate new york city as nyc", "the capital of great britain is london", "london is in the uk", "london is in england", "london is in great britain", "it rains a lot in london", "london hosts the british museum", "new york is great and so is london", "i like london better than new york"]) y_train_text = [["new york"],["new york"],["new york"],["new york"],["new york"], ["new york"],["london"],["london"],["london"],["london"], ["london"],["london"],["new york","london"],["new york","london"]] X_test = np.array(['nice day in nyc', 'welcome to london', 'london is rainy', 'it is raining in britian', 'it is raining in britian and the big apple', 'it is raining in britian and nyc', 'hello welcome to new york. enjoy it here and london too']) target_names = ['New York', 'London'] lb = preprocessing.LabelBinarizer() Y = lb.fit_transform(y_train_text) classifier = Pipeline([ ('vectorizer', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf', OneVsRestClassifier(LinearSVC()))]) classifier.fit(X_train, Y) predicted = classifier.predict(X_test) all_labels = lb.inverse_transform(predicted) for item, labels in zip(X_test, all_labels): print '%s => %s' % (item, ', '.join(labels)) I am getting Traceback (most recent call last): File "phrase.py", line 37, in <module> Y = lb.fit_transform(y_train_text) File "/Library/Python/2.7/site-packages/sklearn/base.py", line 455, in fit_transform return self.fit(X, **fit_params).transform(X) File "/Library/Python/2.7/site-packages/sklearn/preprocessing/label.py", line 300, in fit self.y_type_ = type_of_target(y) File "/Library/Python/2.7/site-packages/sklearn/utils/multiclass.py", line 251, in type_of_target raise ValueError('You appear to be using a legacy multi-label data' ValueError: You appear to be using a legacy multi-label data representation. Sequence of sequences are no longer supported; use a binary array or sparse matrix instead. I tried to change the y_train_text = [["new york"],["new york"],["new york"],["new york"],["new york"], ["new york"],["london"],["london"],["london"],["london"], ["london"],["london"],["new york","london"],["new york","london"]] to y_train_text = [[1,0], [1,0], [1,0], [1,0], [1,0], [1,0], [0,1], [0,1], [0,1], [0,1], [0,1], [0,1], [1,1], [1,1]] then I am getting ValueError: Multioutput target data is not supported with label binarization Could some one please tell me how to resolve this. Best regards, Mukesh Tiwari [1] http://stackoverflow.com/questions/10526579/use-scikit-learn-to-classify-into-multiple-categories
------------------------------------------------------------------------------
_______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general