Hello Everyone,
I am trying to learn scikit and my problem is somewhat related to this
problem [1]. When I am trying to run the code

import numpy as npfrom sklearn.pipeline import Pipelinefrom
sklearn.feature_extraction.text import CountVectorizerfrom sklearn.svm
import LinearSVCfrom sklearn.feature_extraction.text import
TfidfTransformerfrom sklearn.multiclass import OneVsRestClassifierfrom
sklearn import preprocessing

X_train = np.array(["new york is a hell of a town",
                    "new york was originally dutch",
                    "the big apple is great",
                    "new york is also called the big apple",
                    "nyc is nice",
                    "people abbreviate new york city as nyc",
                    "the capital of great britain is london",
                    "london is in the uk",
                    "london is in england",
                    "london is in great britain",
                    "it rains a lot in london",
                    "london hosts the british museum",
                    "new york is great and so is london",
                    "i like london better than new york"])
y_train_text = [["new york"],["new york"],["new york"],["new
york"],["new york"],
                ["new york"],["london"],["london"],["london"],["london"],
                ["london"],["london"],["new york","london"],["new
york","london"]]

X_test = np.array(['nice day in nyc',
                   'welcome to london',
                   'london is rainy',
                   'it is raining in britian',
                   'it is raining in britian and the big apple',
                   'it is raining in britian and nyc',
                   'hello welcome to new york. enjoy it here and london too'])
target_names = ['New York', 'London']

lb = preprocessing.LabelBinarizer()
Y = lb.fit_transform(y_train_text)

classifier = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsRestClassifier(LinearSVC()))])

classifier.fit(X_train, Y)
predicted = classifier.predict(X_test)
all_labels = lb.inverse_transform(predicted)
for item, labels in zip(X_test, all_labels):
    print '%s => %s' % (item, ', '.join(labels))


I am getting
Traceback (most recent call last):
File "phrase.py", line 37, in <module>
Y = lb.fit_transform(y_train_text)
File "/Library/Python/2.7/site-packages/sklearn/base.py", line 455, in
fit_transform
return self.fit(X, **fit_params).transform(X)
File "/Library/Python/2.7/site-packages/sklearn/preprocessing/label.py",
line 300, in fit
self.y_type_ = type_of_target(y)
File "/Library/Python/2.7/site-packages/sklearn/utils/multiclass.py",
line 251, in type_of_target
raise ValueError('You appear to be using a legacy multi-label data'
ValueError: You appear to be using a legacy multi-label data
representation. Sequence of sequences are no longer supported; use a
binary array or sparse matrix instead.

I tried to change the

y_train_text = [["new york"],["new york"],["new york"],["new
york"],["new york"],
                ["new york"],["london"],["london"],["london"],["london"],
                ["london"],["london"],["new york","london"],["new
york","london"]]

to y_train_text = [[1,0], [1,0], [1,0], [1,0], [1,0],
                   [1,0], [0,1], [0,1], [0,1], [0,1],
                   [0,1], [0,1], [1,1], [1,1]]

then I am getting
ValueError: Multioutput target data is not supported with label binarization

Could some one please tell me how to resolve this.

Best regards,
Mukesh Tiwari


[1]
http://stackoverflow.com/questions/10526579/use-scikit-learn-to-classify-into-multiple-categories
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to