Re: [Scikit-learn-general] (no subject)

Sujit Pal Fri, 11 Dec 2015 09:58:59 -0800

Hi Mukesh,

I think you are looking for *multi-label classifiers* where a record can be
of multiple classes. According to this page:
http://scikit-learn.org/stable/modules/multiclass.html


The following classifiers support multilabel - Decision Tree, Random
> Forest, Nearest Neighbor and Ridge Regression.


By changing the binarizer to MultiLabelBinarizer, and the LinearSVC
reference to one of the supported classifers, I was able to get this to run
to completion. The predict(X) method returns only a single class, I used
predict_proba(X) to get a vector of probabilities for each class. You
probably need some sort of cutoff to determine if something is in a class
or not. My changes are as follows. Replacing the binarizer:

#lb = preprocessing.LabelBinarizer()
> lb = preprocessing.MultiLabelBinarizer()
> Y = lb.fit_transform(y_train_text)


Replacing the classifier to one of the supported ones in the pipeline.

> classifier = Pipeline([
>     ('vectorizer', CountVectorizer()),
>     ('tfidf', TfidfTransformer()),
>     ('clf', OneVsRestClassifier(RandomForestClassifier()))])
> #    ('clf', OneVsRestClassifier(KNeighborsClassifier()))])
> #    ('clf', OneVsRestClassifier(LinearSVC()))])


FInally replacing the call to predict(Xtest) with predict_proba(X_test).

> classifier.fit(X_train, Y)
> #predicted = classifier.predict(X_test)
> predicted = classifier.predict_proba(X_test)
> #all_labels = lb.inverse_transform(predicted)


I just printed out the predicted matrix and this is what I get with
KNeighborsClassifier and RandomForestClassifier.

KNeighborsClassifier:
> [[ 0.6  0.4]
>  [ 1.   0. ]
>  [ 1.   0.2]
>  [ 1.   0.2]
>  [ 0.6  0.6]
>  [ 0.8  0.4]
>  [ 0.6  0.8]]
>


> RandomForestClassifier:
> [[ 0.3  0.3]
>  [ 0.9  0.3]
>  [ 1.   0.4]
>  [ 0.7  0.2]
>  [ 0.4  0.3]
>  [ 0.4  0.2]
>  [ 0.5  0.5]]


If you threshold at 0.5 you will get reasonable results with
KNeighborsClassifier, though not as accurate as hoped. Maybe it needs more
input or some experimentation with hyperparameters. Something like this:

#for item, labels in zip(X_test, predicted):
> #    print '%s => %s' % (item, ', '.join(str(labels)))
> for item, preds in zip(X_test, predicted):
>     norm_preds = [(0 if x < 0.5 else 1) for x in preds.tolist()]
>     pred_targets = ["" if x[1] == 0 else target_names[x[0]]
>                     for x in enumerate(norm_preds)]
>     print item, filter(lambda x: len(x.strip()) > 0, pred_targets)


returns these results:

nice day in nyc ['New York']
> welcome to london ['New York']
> london is rainy ['New York']
> it is raining in britian ['New York']
> it is raining in britian and the big apple ['New York', 'London']
> it is raining in britian and nyc ['New York']
> hello welcome to new york. enjoy it here and london too ['New York',
> 'London']


-sujit

On Thu, Dec 10, 2015 at 10:29 PM, mukesh tiwari <
mukeshtiwari.ii...@gmail.com> wrote:

> Dear Sujit,
> Thank you for reply and solution. It's working great but using this I can
> determine only one feature at a time. The last line
> "hello welcome to new york. enjoy it here and london too"  should output
> "london, new york" but it's only giving "new york".
>
> I am trying to do sentiment analysis of hotel review based on 6 aspects
> like Restaurant, Frontdesk, Room Amenities & Experience, Washrooms, Hotel
> and Internet and all these categories has sub category (almost 90). I am
> tagging each review with sentiment about the sub category. I can tag a
> review with multiple sub category so my requirement is multi-label. Example
> is given below.
>
> The location was excellent for this hotel as it's super close to the
> airport and the wifi connection was relatively okay but those were the only
> perks => * Positive Location, Positive Wifi*
>
> The room was filthy, we had to call reception twice to ask for toilet
> paper as we didn't have any and there were stains on the walls, the toilet
> seat, balls of hair on the floor, need I carry on => *Negative Room,
> Negative Walls, Negative Toilet seat *
>
> The picture of the "breakfast buffet" says it all really.
>
> *Negative Breakfast *All in all we won't be coming back no matter how
> close it is. => *Negative Experience*
>
> I am building my term matrix simply by term frequency, inverse document
> frequency. In short, I have matrix n X m matrix (n samples and m features)
>
> >>> data
> array([[1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0,
>         0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0],
>        [0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0,
>         0, 1, 0, 1, 0, 1, 1, 0, 2, 1, 1, 0, 0],
>        [0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1,
>         1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1]])
>
> and the output is
> >>> y
> [[1, 1, 0, 0, 0], [1, 1, 1, 1, 0], [1, 1, 1, 1, 1]]
>
> and now I need a classifier for this purpose.
>
>
> Best regards,
> Mukesh Tiwari
>
>
> On Thu, Dec 10, 2015 at 11:08 PM, Sujit Pal <sujitatgt...@gmail.com>
> wrote:
>
>> Hi Mukesh,
>>
>> I was getting the following error from your code on my environment
>> (Python 2.7.11 - Anaconda 2.4.1, scikit-learn 0.17) on Mac OSX 10.9 for the
>> following line:
>>
>>     Y = lb.fit_transform(y_train_text)
>>> ValueError: You appear to be using a legacy multi-label data
>>> representation. Sequence of sequences are no longer supported; use a binary
>>> array or sparse matrix instead.
>>
>>
>> To fix, I did this:
>>
>> y_train_text0 = [["new york"],["new york"],["new york"],["new
>>> york"],["new york"],
>>>                 ["new york"],["london"],["london"],["london"],["london"],
>>>                 ["london"],["london"],["new york","london"],["new
>>> york","london"]]
>>> y_train_text = [x[0] for x in y_train_text0]
>>
>>
>> and a cosmetic fix here:
>>
>> for item, labels in zip(X_test, all_labels):
>>>     print '%s => %s' % (item, labels)
>>
>>
>> and now getting following result:
>>
>> nice day in nyc => new york
>>> welcome to london => london
>>> london is rainy => london
>>> it is raining in britian => london
>>> it is raining in britian and the big apple => new york
>>> it is raining in britian and nyc => new york
>>> hello welcome to new york. enjoy it here and london too => new york
>>
>>
>> -sujit
>>
>>
>> On Thu, Dec 10, 2015 at 4:08 AM, mukesh tiwari <
>> mukeshtiwari.ii...@gmail.com> wrote:
>>
>>> Hello Everyone,
>>> I am trying to learn scikit and my problem is somewhat related to this
>>> problem [1]. When I am trying to run the code
>>>
>>> import numpy as npfrom sklearn.pipeline import Pipelinefrom 
>>> sklearn.feature_extraction.text import CountVectorizerfrom sklearn.svm 
>>> import LinearSVCfrom sklearn.feature_extraction.text import 
>>> TfidfTransformerfrom sklearn.multiclass import OneVsRestClassifierfrom 
>>> sklearn import preprocessing
>>>
>>> X_train = np.array(["new york is a hell of a town",
>>>                     "new york was originally dutch",
>>>                     "the big apple is great",
>>>                     "new york is also called the big apple",
>>>                     "nyc is nice",
>>>                     "people abbreviate new york city as nyc",
>>>                     "the capital of great britain is london",
>>>                     "london is in the uk",
>>>                     "london is in england",
>>>                     "london is in great britain",
>>>                     "it rains a lot in london",
>>>                     "london hosts the british museum",
>>>                     "new york is great and so is london",
>>>                     "i like london better than new york"])
>>> y_train_text = [["new york"],["new york"],["new york"],["new york"],["new 
>>> york"],
>>>                 ["new york"],["london"],["london"],["london"],["london"],
>>>                 ["london"],["london"],["new york","london"],["new 
>>> york","london"]]
>>>
>>> X_test = np.array(['nice day in nyc',
>>>                    'welcome to london',
>>>                    'london is rainy',
>>>                    'it is raining in britian',
>>>                    'it is raining in britian and the big apple',
>>>                    'it is raining in britian and nyc',
>>>                    'hello welcome to new york. enjoy it here and london 
>>> too'])
>>> target_names = ['New York', 'London']
>>>
>>> lb = preprocessing.LabelBinarizer()
>>> Y = lb.fit_transform(y_train_text)
>>>
>>> classifier = Pipeline([
>>>     ('vectorizer', CountVectorizer()),
>>>     ('tfidf', TfidfTransformer()),
>>>     ('clf', OneVsRestClassifier(LinearSVC()))])
>>>
>>> classifier.fit(X_train, Y)
>>> predicted = classifier.predict(X_test)
>>> all_labels = lb.inverse_transform(predicted)
>>> for item, labels in zip(X_test, all_labels):
>>>     print '%s => %s' % (item, ', '.join(labels))
>>>
>>>
>>> I am getting
>>> Traceback (most recent call last):
>>> File "phrase.py", line 37, in <module>
>>> Y = lb.fit_transform(y_train_text)
>>> File "/Library/Python/2.7/site-packages/sklearn/base.py", line 455, in 
>>> fit_transform
>>> return self.fit(X, **fit_params).transform(X)
>>> File "/Library/Python/2.7/site-packages/sklearn/preprocessing/label.py", 
>>> line 300, in fit
>>> self.y_type_ = type_of_target(y)
>>> File "/Library/Python/2.7/site-packages/sklearn/utils/multiclass.py", line 
>>> 251, in type_of_target
>>> raise ValueError('You appear to be using a legacy multi-label data'
>>> ValueError: You appear to be using a legacy multi-label data 
>>> representation. Sequence of sequences are no longer supported; use a binary 
>>> array or sparse matrix instead.
>>>
>>> I tried to change the
>>>
>>> y_train_text = [["new york"],["new york"],["new york"],["new york"],["new 
>>> york"],
>>>                 ["new york"],["london"],["london"],["london"],["london"],
>>>                 ["london"],["london"],["new york","london"],["new 
>>> york","london"]]
>>>
>>> to y_train_text = [[1,0], [1,0], [1,0], [1,0], [1,0],
>>>                    [1,0], [0,1], [0,1], [0,1], [0,1],
>>>                    [0,1], [0,1], [1,1], [1,1]]
>>>
>>> then I am getting
>>> ValueError: Multioutput target data is not supported with label binarization
>>>
>>> Could some one please tell me how to resolve this.
>>>
>>> Best regards,
>>> Mukesh Tiwari
>>>
>>>
>>> [1]
>>> http://stackoverflow.com/questions/10526579/use-scikit-learn-to-classify-into-multiple-categories
>>>
>>>
>>> ------------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> Scikit-learn-general@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>>
>>
>>
>> ------------------------------------------------------------------------------
>>
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>

------------------------------------------------------------------------------

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] (no subject)

Reply via email to