Dear Sujit,
Thank you for reply and solution. It's working great but using this I can
determine only one feature at a time. The last line
"hello welcome to new york. enjoy it here and london too"  should output
"london, new york" but it's only giving "new york".

I am trying to do sentiment analysis of hotel review based on 6 aspects
like Restaurant, Frontdesk, Room Amenities & Experience, Washrooms, Hotel
and Internet and all these categories has sub category (almost 90). I am
tagging each review with sentiment about the sub category. I can tag a
review with multiple sub category so my requirement is multi-label. Example
is given below.

The location was excellent for this hotel as it's super close to the
airport and the wifi connection was relatively okay but those were the only
perks => * Positive Location, Positive Wifi*

The room was filthy, we had to call reception twice to ask for toilet paper
as we didn't have any and there were stains on the walls, the toilet seat,
balls of hair on the floor, need I carry on => *Negative Room, Negative
Walls, Negative Toilet seat *

The picture of the "breakfast buffet" says it all really.

*Negative Breakfast *All in all we won't be coming back no matter how close
it is. => *Negative Experience*

I am building my term matrix simply by term frequency, inverse document
frequency. In short, I have matrix n X m matrix (n samples and m features)

>>> data
array([[1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0,
        0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0],
       [0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0,
        0, 1, 0, 1, 0, 1, 1, 0, 2, 1, 1, 0, 0],
       [0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1,
        1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1]])

and the output is
>>> y
[[1, 1, 0, 0, 0], [1, 1, 1, 1, 0], [1, 1, 1, 1, 1]]

and now I need a classifier for this purpose.


Best regards,
Mukesh Tiwari


On Thu, Dec 10, 2015 at 11:08 PM, Sujit Pal <sujitatgt...@gmail.com> wrote:

> Hi Mukesh,
>
> I was getting the following error from your code on my environment (Python
> 2.7.11 - Anaconda 2.4.1, scikit-learn 0.17) on Mac OSX 10.9 for the
> following line:
>
>     Y = lb.fit_transform(y_train_text)
>> ValueError: You appear to be using a legacy multi-label data
>> representation. Sequence of sequences are no longer supported; use a binary
>> array or sparse matrix instead.
>
>
> To fix, I did this:
>
> y_train_text0 = [["new york"],["new york"],["new york"],["new york"],["new
>> york"],
>>                 ["new york"],["london"],["london"],["london"],["london"],
>>                 ["london"],["london"],["new york","london"],["new
>> york","london"]]
>> y_train_text = [x[0] for x in y_train_text0]
>
>
> and a cosmetic fix here:
>
> for item, labels in zip(X_test, all_labels):
>>     print '%s => %s' % (item, labels)
>
>
> and now getting following result:
>
> nice day in nyc => new york
>> welcome to london => london
>> london is rainy => london
>> it is raining in britian => london
>> it is raining in britian and the big apple => new york
>> it is raining in britian and nyc => new york
>> hello welcome to new york. enjoy it here and london too => new york
>
>
> -sujit
>
>
> On Thu, Dec 10, 2015 at 4:08 AM, mukesh tiwari <
> mukeshtiwari.ii...@gmail.com> wrote:
>
>> Hello Everyone,
>> I am trying to learn scikit and my problem is somewhat related to this
>> problem [1]. When I am trying to run the code
>>
>> import numpy as npfrom sklearn.pipeline import Pipelinefrom 
>> sklearn.feature_extraction.text import CountVectorizerfrom sklearn.svm 
>> import LinearSVCfrom sklearn.feature_extraction.text import 
>> TfidfTransformerfrom sklearn.multiclass import OneVsRestClassifierfrom 
>> sklearn import preprocessing
>>
>> X_train = np.array(["new york is a hell of a town",
>>                     "new york was originally dutch",
>>                     "the big apple is great",
>>                     "new york is also called the big apple",
>>                     "nyc is nice",
>>                     "people abbreviate new york city as nyc",
>>                     "the capital of great britain is london",
>>                     "london is in the uk",
>>                     "london is in england",
>>                     "london is in great britain",
>>                     "it rains a lot in london",
>>                     "london hosts the british museum",
>>                     "new york is great and so is london",
>>                     "i like london better than new york"])
>> y_train_text = [["new york"],["new york"],["new york"],["new york"],["new 
>> york"],
>>                 ["new york"],["london"],["london"],["london"],["london"],
>>                 ["london"],["london"],["new york","london"],["new 
>> york","london"]]
>>
>> X_test = np.array(['nice day in nyc',
>>                    'welcome to london',
>>                    'london is rainy',
>>                    'it is raining in britian',
>>                    'it is raining in britian and the big apple',
>>                    'it is raining in britian and nyc',
>>                    'hello welcome to new york. enjoy it here and london 
>> too'])
>> target_names = ['New York', 'London']
>>
>> lb = preprocessing.LabelBinarizer()
>> Y = lb.fit_transform(y_train_text)
>>
>> classifier = Pipeline([
>>     ('vectorizer', CountVectorizer()),
>>     ('tfidf', TfidfTransformer()),
>>     ('clf', OneVsRestClassifier(LinearSVC()))])
>>
>> classifier.fit(X_train, Y)
>> predicted = classifier.predict(X_test)
>> all_labels = lb.inverse_transform(predicted)
>> for item, labels in zip(X_test, all_labels):
>>     print '%s => %s' % (item, ', '.join(labels))
>>
>>
>> I am getting
>> Traceback (most recent call last):
>> File "phrase.py", line 37, in <module>
>> Y = lb.fit_transform(y_train_text)
>> File "/Library/Python/2.7/site-packages/sklearn/base.py", line 455, in 
>> fit_transform
>> return self.fit(X, **fit_params).transform(X)
>> File "/Library/Python/2.7/site-packages/sklearn/preprocessing/label.py", 
>> line 300, in fit
>> self.y_type_ = type_of_target(y)
>> File "/Library/Python/2.7/site-packages/sklearn/utils/multiclass.py", line 
>> 251, in type_of_target
>> raise ValueError('You appear to be using a legacy multi-label data'
>> ValueError: You appear to be using a legacy multi-label data representation. 
>> Sequence of sequences are no longer supported; use a binary array or sparse 
>> matrix instead.
>>
>> I tried to change the
>>
>> y_train_text = [["new york"],["new york"],["new york"],["new york"],["new 
>> york"],
>>                 ["new york"],["london"],["london"],["london"],["london"],
>>                 ["london"],["london"],["new york","london"],["new 
>> york","london"]]
>>
>> to y_train_text = [[1,0], [1,0], [1,0], [1,0], [1,0],
>>                    [1,0], [0,1], [0,1], [0,1], [0,1],
>>                    [0,1], [0,1], [1,1], [1,1]]
>>
>> then I am getting
>> ValueError: Multioutput target data is not supported with label binarization
>>
>> Could some one please tell me how to resolve this.
>>
>> Best regards,
>> Mukesh Tiwari
>>
>>
>> [1]
>> http://stackoverflow.com/questions/10526579/use-scikit-learn-to-classify-into-multiple-categories
>>
>>
>> ------------------------------------------------------------------------------
>>
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to