Hi Mukesh,
I think you are looking for *multi-label classifiers* where a record can be
of multiple classes. According to this page:
http://scikit-learn.org/stable/modules/multiclass.html
The following classifiers support multilabel - Decision Tree, Random
> Forest, Nearest Neighbor and Ridge Regression.
By changing the binarizer to MultiLabelBinarizer, and the LinearSVC
reference to one of the supported classifers, I was able to get this to run
to completion. The predict(X) method returns only a single class, I used
predict_proba(X) to get a vector of probabilities for each class. You
probably need some sort of cutoff to determine if something is in a class
or not. My changes are as follows. Replacing the binarizer:
#lb = preprocessing.LabelBinarizer()
> lb = preprocessing.MultiLabelBinarizer()
> Y = lb.fit_transform(y_train_text)
Replacing the classifier to one of the supported ones in the pipeline.
> classifier = Pipeline([
> ('vectorizer', CountVectorizer()),
> ('tfidf', TfidfTransformer()),
> ('clf', OneVsRestClassifier(RandomForestClassifier()))])
> # ('clf', OneVsRestClassifier(KNeighborsClassifier()))])
> # ('clf', OneVsRestClassifier(LinearSVC()))])
FInally replacing the call to predict(Xtest) with predict_proba(X_test).
> classifier.fit(X_train, Y)
> #predicted = classifier.predict(X_test)
> predicted = classifier.predict_proba(X_test)
> #all_labels = lb.inverse_transform(predicted)
I just printed out the predicted matrix and this is what I get with
KNeighborsClassifier and RandomForestClassifier.
KNeighborsClassifier:
> [[ 0.6 0.4]
> [ 1. 0. ]
> [ 1. 0.2]
> [ 1. 0.2]
> [ 0.6 0.6]
> [ 0.8 0.4]
> [ 0.6 0.8]]
>
> RandomForestClassifier:
> [[ 0.3 0.3]
> [ 0.9 0.3]
> [ 1. 0.4]
> [ 0.7 0.2]
> [ 0.4 0.3]
> [ 0.4 0.2]
> [ 0.5 0.5]]
If you threshold at 0.5 you will get reasonable results with
KNeighborsClassifier, though not as accurate as hoped. Maybe it needs more
input or some experimentation with hyperparameters. Something like this:
#for item, labels in zip(X_test, predicted):
> # print '%s => %s' % (item, ', '.join(str(labels)))
> for item, preds in zip(X_test, predicted):
> norm_preds = [(0 if x < 0.5 else 1) for x in preds.tolist()]
> pred_targets = ["" if x[1] == 0 else target_names[x[0]]
> for x in enumerate(norm_preds)]
> print item, filter(lambda x: len(x.strip()) > 0, pred_targets)
returns these results:
nice day in nyc ['New York']
> welcome to london ['New York']
> london is rainy ['New York']
> it is raining in britian ['New York']
> it is raining in britian and the big apple ['New York', 'London']
> it is raining in britian and nyc ['New York']
> hello welcome to new york. enjoy it here and london too ['New York',
> 'London']
-sujit
On Thu, Dec 10, 2015 at 10:29 PM, mukesh tiwari <
[email protected]> wrote:
> Dear Sujit,
> Thank you for reply and solution. It's working great but using this I can
> determine only one feature at a time. The last line
> "hello welcome to new york. enjoy it here and london too" should output
> "london, new york" but it's only giving "new york".
>
> I am trying to do sentiment analysis of hotel review based on 6 aspects
> like Restaurant, Frontdesk, Room Amenities & Experience, Washrooms, Hotel
> and Internet and all these categories has sub category (almost 90). I am
> tagging each review with sentiment about the sub category. I can tag a
> review with multiple sub category so my requirement is multi-label. Example
> is given below.
>
> The location was excellent for this hotel as it's super close to the
> airport and the wifi connection was relatively okay but those were the only
> perks => * Positive Location, Positive Wifi*
>
> The room was filthy, we had to call reception twice to ask for toilet
> paper as we didn't have any and there were stains on the walls, the toilet
> seat, balls of hair on the floor, need I carry on => *Negative Room,
> Negative Walls, Negative Toilet seat *
>
> The picture of the "breakfast buffet" says it all really.
>
> *Negative Breakfast *All in all we won't be coming back no matter how
> close it is. => *Negative Experience*
>
> I am building my term matrix simply by term frequency, inverse document
> frequency. In short, I have matrix n X m matrix (n samples and m features)
>
> >>> data
> array([[1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0,
> 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0],
> [0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0,
> 0, 1, 0, 1, 0, 1, 1, 0, 2, 1, 1, 0, 0],
> [0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1,
> 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1]])
>
> and the output is
> >>> y
> [[1, 1, 0, 0, 0], [1, 1, 1, 1, 0], [1, 1, 1, 1, 1]]
>
> and now I need a classifier for this purpose.
>
>
> Best regards,
> Mukesh Tiwari
>
>
> On Thu, Dec 10, 2015 at 11:08 PM, Sujit Pal <[email protected]>
> wrote:
>
>> Hi Mukesh,
>>
>> I was getting the following error from your code on my environment
>> (Python 2.7.11 - Anaconda 2.4.1, scikit-learn 0.17) on Mac OSX 10.9 for the
>> following line:
>>
>> Y = lb.fit_transform(y_train_text)
>>> ValueError: You appear to be using a legacy multi-label data
>>> representation. Sequence of sequences are no longer supported; use a binary
>>> array or sparse matrix instead.
>>
>>
>> To fix, I did this:
>>
>> y_train_text0 = [["new york"],["new york"],["new york"],["new
>>> york"],["new york"],
>>> ["new york"],["london"],["london"],["london"],["london"],
>>> ["london"],["london"],["new york","london"],["new
>>> york","london"]]
>>> y_train_text = [x[0] for x in y_train_text0]
>>
>>
>> and a cosmetic fix here:
>>
>> for item, labels in zip(X_test, all_labels):
>>> print '%s => %s' % (item, labels)
>>
>>
>> and now getting following result:
>>
>> nice day in nyc => new york
>>> welcome to london => london
>>> london is rainy => london
>>> it is raining in britian => london
>>> it is raining in britian and the big apple => new york
>>> it is raining in britian and nyc => new york
>>> hello welcome to new york. enjoy it here and london too => new york
>>
>>
>> -sujit
>>
>>
>> On Thu, Dec 10, 2015 at 4:08 AM, mukesh tiwari <
>> [email protected]> wrote:
>>
>>> Hello Everyone,
>>> I am trying to learn scikit and my problem is somewhat related to this
>>> problem [1]. When I am trying to run the code
>>>
>>> import numpy as npfrom sklearn.pipeline import Pipelinefrom
>>> sklearn.feature_extraction.text import CountVectorizerfrom sklearn.svm
>>> import LinearSVCfrom sklearn.feature_extraction.text import
>>> TfidfTransformerfrom sklearn.multiclass import OneVsRestClassifierfrom
>>> sklearn import preprocessing
>>>
>>> X_train = np.array(["new york is a hell of a town",
>>> "new york was originally dutch",
>>> "the big apple is great",
>>> "new york is also called the big apple",
>>> "nyc is nice",
>>> "people abbreviate new york city as nyc",
>>> "the capital of great britain is london",
>>> "london is in the uk",
>>> "london is in england",
>>> "london is in great britain",
>>> "it rains a lot in london",
>>> "london hosts the british museum",
>>> "new york is great and so is london",
>>> "i like london better than new york"])
>>> y_train_text = [["new york"],["new york"],["new york"],["new york"],["new
>>> york"],
>>> ["new york"],["london"],["london"],["london"],["london"],
>>> ["london"],["london"],["new york","london"],["new
>>> york","london"]]
>>>
>>> X_test = np.array(['nice day in nyc',
>>> 'welcome to london',
>>> 'london is rainy',
>>> 'it is raining in britian',
>>> 'it is raining in britian and the big apple',
>>> 'it is raining in britian and nyc',
>>> 'hello welcome to new york. enjoy it here and london
>>> too'])
>>> target_names = ['New York', 'London']
>>>
>>> lb = preprocessing.LabelBinarizer()
>>> Y = lb.fit_transform(y_train_text)
>>>
>>> classifier = Pipeline([
>>> ('vectorizer', CountVectorizer()),
>>> ('tfidf', TfidfTransformer()),
>>> ('clf', OneVsRestClassifier(LinearSVC()))])
>>>
>>> classifier.fit(X_train, Y)
>>> predicted = classifier.predict(X_test)
>>> all_labels = lb.inverse_transform(predicted)
>>> for item, labels in zip(X_test, all_labels):
>>> print '%s => %s' % (item, ', '.join(labels))
>>>
>>>
>>> I am getting
>>> Traceback (most recent call last):
>>> File "phrase.py", line 37, in <module>
>>> Y = lb.fit_transform(y_train_text)
>>> File "/Library/Python/2.7/site-packages/sklearn/base.py", line 455, in
>>> fit_transform
>>> return self.fit(X, **fit_params).transform(X)
>>> File "/Library/Python/2.7/site-packages/sklearn/preprocessing/label.py",
>>> line 300, in fit
>>> self.y_type_ = type_of_target(y)
>>> File "/Library/Python/2.7/site-packages/sklearn/utils/multiclass.py", line
>>> 251, in type_of_target
>>> raise ValueError('You appear to be using a legacy multi-label data'
>>> ValueError: You appear to be using a legacy multi-label data
>>> representation. Sequence of sequences are no longer supported; use a binary
>>> array or sparse matrix instead.
>>>
>>> I tried to change the
>>>
>>> y_train_text = [["new york"],["new york"],["new york"],["new york"],["new
>>> york"],
>>> ["new york"],["london"],["london"],["london"],["london"],
>>> ["london"],["london"],["new york","london"],["new
>>> york","london"]]
>>>
>>> to y_train_text = [[1,0], [1,0], [1,0], [1,0], [1,0],
>>> [1,0], [0,1], [0,1], [0,1], [0,1],
>>> [0,1], [0,1], [1,1], [1,1]]
>>>
>>> then I am getting
>>> ValueError: Multioutput target data is not supported with label binarization
>>>
>>> Could some one please tell me how to resolve this.
>>>
>>> Best regards,
>>> Mukesh Tiwari
>>>
>>>
>>> [1]
>>> http://stackoverflow.com/questions/10526579/use-scikit-learn-to-classify-into-multiple-categories
>>>
>>>
>>> ------------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>
>>
>> ------------------------------------------------------------------------------
>>
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general