Hi Mukesh, I think you are looking for *multi-label classifiers* where a record can be of multiple classes. According to this page: http://scikit-learn.org/stable/modules/multiclass.html
The following classifiers support multilabel - Decision Tree, Random > Forest, Nearest Neighbor and Ridge Regression. By changing the binarizer to MultiLabelBinarizer, and the LinearSVC reference to one of the supported classifers, I was able to get this to run to completion. The predict(X) method returns only a single class, I used predict_proba(X) to get a vector of probabilities for each class. You probably need some sort of cutoff to determine if something is in a class or not. My changes are as follows. Replacing the binarizer: #lb = preprocessing.LabelBinarizer() > lb = preprocessing.MultiLabelBinarizer() > Y = lb.fit_transform(y_train_text) Replacing the classifier to one of the supported ones in the pipeline. > classifier = Pipeline([ > ('vectorizer', CountVectorizer()), > ('tfidf', TfidfTransformer()), > ('clf', OneVsRestClassifier(RandomForestClassifier()))]) > # ('clf', OneVsRestClassifier(KNeighborsClassifier()))]) > # ('clf', OneVsRestClassifier(LinearSVC()))]) FInally replacing the call to predict(Xtest) with predict_proba(X_test). > classifier.fit(X_train, Y) > #predicted = classifier.predict(X_test) > predicted = classifier.predict_proba(X_test) > #all_labels = lb.inverse_transform(predicted) I just printed out the predicted matrix and this is what I get with KNeighborsClassifier and RandomForestClassifier. KNeighborsClassifier: > [[ 0.6 0.4] > [ 1. 0. ] > [ 1. 0.2] > [ 1. 0.2] > [ 0.6 0.6] > [ 0.8 0.4] > [ 0.6 0.8]] > > RandomForestClassifier: > [[ 0.3 0.3] > [ 0.9 0.3] > [ 1. 0.4] > [ 0.7 0.2] > [ 0.4 0.3] > [ 0.4 0.2] > [ 0.5 0.5]] If you threshold at 0.5 you will get reasonable results with KNeighborsClassifier, though not as accurate as hoped. Maybe it needs more input or some experimentation with hyperparameters. Something like this: #for item, labels in zip(X_test, predicted): > # print '%s => %s' % (item, ', '.join(str(labels))) > for item, preds in zip(X_test, predicted): > norm_preds = [(0 if x < 0.5 else 1) for x in preds.tolist()] > pred_targets = ["" if x[1] == 0 else target_names[x[0]] > for x in enumerate(norm_preds)] > print item, filter(lambda x: len(x.strip()) > 0, pred_targets) returns these results: nice day in nyc ['New York'] > welcome to london ['New York'] > london is rainy ['New York'] > it is raining in britian ['New York'] > it is raining in britian and the big apple ['New York', 'London'] > it is raining in britian and nyc ['New York'] > hello welcome to new york. enjoy it here and london too ['New York', > 'London'] -sujit On Thu, Dec 10, 2015 at 10:29 PM, mukesh tiwari < mukeshtiwari.ii...@gmail.com> wrote: > Dear Sujit, > Thank you for reply and solution. It's working great but using this I can > determine only one feature at a time. The last line > "hello welcome to new york. enjoy it here and london too" should output > "london, new york" but it's only giving "new york". > > I am trying to do sentiment analysis of hotel review based on 6 aspects > like Restaurant, Frontdesk, Room Amenities & Experience, Washrooms, Hotel > and Internet and all these categories has sub category (almost 90). I am > tagging each review with sentiment about the sub category. I can tag a > review with multiple sub category so my requirement is multi-label. Example > is given below. > > The location was excellent for this hotel as it's super close to the > airport and the wifi connection was relatively okay but those were the only > perks => * Positive Location, Positive Wifi* > > The room was filthy, we had to call reception twice to ask for toilet > paper as we didn't have any and there were stains on the walls, the toilet > seat, balls of hair on the floor, need I carry on => *Negative Room, > Negative Walls, Negative Toilet seat * > > The picture of the "breakfast buffet" says it all really. > > *Negative Breakfast *All in all we won't be coming back no matter how > close it is. => *Negative Experience* > > I am building my term matrix simply by term frequency, inverse document > frequency. In short, I have matrix n X m matrix (n samples and m features) > > >>> data > array([[1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, > 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0], > [0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, > 0, 1, 0, 1, 0, 1, 1, 0, 2, 1, 1, 0, 0], > [0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, > 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1]]) > > and the output is > >>> y > [[1, 1, 0, 0, 0], [1, 1, 1, 1, 0], [1, 1, 1, 1, 1]] > > and now I need a classifier for this purpose. > > > Best regards, > Mukesh Tiwari > > > On Thu, Dec 10, 2015 at 11:08 PM, Sujit Pal <sujitatgt...@gmail.com> > wrote: > >> Hi Mukesh, >> >> I was getting the following error from your code on my environment >> (Python 2.7.11 - Anaconda 2.4.1, scikit-learn 0.17) on Mac OSX 10.9 for the >> following line: >> >> Y = lb.fit_transform(y_train_text) >>> ValueError: You appear to be using a legacy multi-label data >>> representation. Sequence of sequences are no longer supported; use a binary >>> array or sparse matrix instead. >> >> >> To fix, I did this: >> >> y_train_text0 = [["new york"],["new york"],["new york"],["new >>> york"],["new york"], >>> ["new york"],["london"],["london"],["london"],["london"], >>> ["london"],["london"],["new york","london"],["new >>> york","london"]] >>> y_train_text = [x[0] for x in y_train_text0] >> >> >> and a cosmetic fix here: >> >> for item, labels in zip(X_test, all_labels): >>> print '%s => %s' % (item, labels) >> >> >> and now getting following result: >> >> nice day in nyc => new york >>> welcome to london => london >>> london is rainy => london >>> it is raining in britian => london >>> it is raining in britian and the big apple => new york >>> it is raining in britian and nyc => new york >>> hello welcome to new york. enjoy it here and london too => new york >> >> >> -sujit >> >> >> On Thu, Dec 10, 2015 at 4:08 AM, mukesh tiwari < >> mukeshtiwari.ii...@gmail.com> wrote: >> >>> Hello Everyone, >>> I am trying to learn scikit and my problem is somewhat related to this >>> problem [1]. When I am trying to run the code >>> >>> import numpy as npfrom sklearn.pipeline import Pipelinefrom >>> sklearn.feature_extraction.text import CountVectorizerfrom sklearn.svm >>> import LinearSVCfrom sklearn.feature_extraction.text import >>> TfidfTransformerfrom sklearn.multiclass import OneVsRestClassifierfrom >>> sklearn import preprocessing >>> >>> X_train = np.array(["new york is a hell of a town", >>> "new york was originally dutch", >>> "the big apple is great", >>> "new york is also called the big apple", >>> "nyc is nice", >>> "people abbreviate new york city as nyc", >>> "the capital of great britain is london", >>> "london is in the uk", >>> "london is in england", >>> "london is in great britain", >>> "it rains a lot in london", >>> "london hosts the british museum", >>> "new york is great and so is london", >>> "i like london better than new york"]) >>> y_train_text = [["new york"],["new york"],["new york"],["new york"],["new >>> york"], >>> ["new york"],["london"],["london"],["london"],["london"], >>> ["london"],["london"],["new york","london"],["new >>> york","london"]] >>> >>> X_test = np.array(['nice day in nyc', >>> 'welcome to london', >>> 'london is rainy', >>> 'it is raining in britian', >>> 'it is raining in britian and the big apple', >>> 'it is raining in britian and nyc', >>> 'hello welcome to new york. enjoy it here and london >>> too']) >>> target_names = ['New York', 'London'] >>> >>> lb = preprocessing.LabelBinarizer() >>> Y = lb.fit_transform(y_train_text) >>> >>> classifier = Pipeline([ >>> ('vectorizer', CountVectorizer()), >>> ('tfidf', TfidfTransformer()), >>> ('clf', OneVsRestClassifier(LinearSVC()))]) >>> >>> classifier.fit(X_train, Y) >>> predicted = classifier.predict(X_test) >>> all_labels = lb.inverse_transform(predicted) >>> for item, labels in zip(X_test, all_labels): >>> print '%s => %s' % (item, ', '.join(labels)) >>> >>> >>> I am getting >>> Traceback (most recent call last): >>> File "phrase.py", line 37, in <module> >>> Y = lb.fit_transform(y_train_text) >>> File "/Library/Python/2.7/site-packages/sklearn/base.py", line 455, in >>> fit_transform >>> return self.fit(X, **fit_params).transform(X) >>> File "/Library/Python/2.7/site-packages/sklearn/preprocessing/label.py", >>> line 300, in fit >>> self.y_type_ = type_of_target(y) >>> File "/Library/Python/2.7/site-packages/sklearn/utils/multiclass.py", line >>> 251, in type_of_target >>> raise ValueError('You appear to be using a legacy multi-label data' >>> ValueError: You appear to be using a legacy multi-label data >>> representation. Sequence of sequences are no longer supported; use a binary >>> array or sparse matrix instead. >>> >>> I tried to change the >>> >>> y_train_text = [["new york"],["new york"],["new york"],["new york"],["new >>> york"], >>> ["new york"],["london"],["london"],["london"],["london"], >>> ["london"],["london"],["new york","london"],["new >>> york","london"]] >>> >>> to y_train_text = [[1,0], [1,0], [1,0], [1,0], [1,0], >>> [1,0], [0,1], [0,1], [0,1], [0,1], >>> [0,1], [0,1], [1,1], [1,1]] >>> >>> then I am getting >>> ValueError: Multioutput target data is not supported with label binarization >>> >>> Could some one please tell me how to resolve this. >>> >>> Best regards, >>> Mukesh Tiwari >>> >>> >>> [1] >>> http://stackoverflow.com/questions/10526579/use-scikit-learn-to-classify-into-multiple-categories >>> >>> >>> ------------------------------------------------------------------------------ >>> >>> _______________________________________________ >>> Scikit-learn-general mailing list >>> Scikit-learn-general@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >>> >>> >> >> >> ------------------------------------------------------------------------------ >> >> _______________________________________________ >> Scikit-learn-general mailing list >> Scikit-learn-general@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >> >> > > > ------------------------------------------------------------------------------ > > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > >
------------------------------------------------------------------------------
_______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general