Re: [Scikit-learn-general] Unlabelled and mislabelled data

Olivier Grisel Thu, 12 Sep 2013 02:07:02 -0700

2013/9/12 Ark <4rk....@gmail.com>:
>
>  Upon moving to the machine learning approach, we realised the issue
>  is misclassification or even duplication of categories in some cases.
>  In order to get a rough estimate I was thinking of a clustering approach
>  like k means, however since the number of categories might be less than
>  3000, does this seem to be the correct approach? Or if there a better
>  solution, would certainly appreciate pointers.


There is no guarantee that the categories found by k-means will match
the categories you are really interested in.

For instance consider a corpus of comments on a sports websites for 3 sports:

- football,
- tennis,
- cricket.

Assume you are interested in categorizing posts according to 3
sentiment categories:

- positive
- neutral
- negative

If the positive and negative words are more frequent and
discriminative than sport-specific words, then k-means cluster with
k=3 might help get you started. However it's very likely that the
sport specific words will be more discriminative and k-means will
rather base its clustering on those.

In real life it's even more likely that there will be correlations
between sentiments and sports (and even with other non-sport /
non-sentiments related topics) for instance if:

- tennis comments are more neutral on average than football comments,
- football is more negative than tennis and cricket,
- cricket is always positive.

In that case k-means with k=3 will learn cluster that mix and match
sentiment specific words with sports specific words. Such spurious
correlations are very likely to happen if your corpus has some
categories with very little number of example documents.

What you could do on the other hand is:

- get rid of all the categories where you have less than 50 documents
(or start labeling new documents specifically for those categories by
using a keyword search with a blend of very specific keywords for each
categories),

- train a first model on your existing corpus and then start an active
learning loop by running the classifier on a batch of unlabeled
documents to manually label new documents where the classifier is the
less certain (you can choose a classifier with a decision_function or
a predict_proba method to find out about the most un-confident
predictions).

- it might help to add a "garbage" category for all the documents that
don't match any of the categories you are really interested in. It
might be required to help a multiclass classifier get less
corpus-specific bias. You can initialize this category with random
text from wikipedia or the web.

If you plan to seriously increase the number of documents in your
corpus you could also try a Rocchio classifier [1] or a k-NN
classifier. For large text documents collections it's probably more
interesting to implement them can be implemented on top of search
engine such as solr or elastic search with similarity queries. k-NN on
top of solr can be implemented as. Here is some toy code I did for a
k-NN classifier on wikipedia articles using Solr:

https://github.com/ogrisel/pignlproc/blob/master/examples/topic-corpus/categorize.py

[1] http://en.wikipedia.org/wiki/Rocchio_algorithm

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
How ServiceNow helps IT people transform IT departments:
1. Consolidate legacy IT systems to a single system of record for IT
2. Standardize and globalize service processes across IT
3. Implement zero-touch automation to replace manual, redundant tasks
http://pubads.g.doubleclick.net/gampad/clk?id=51271111&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Unlabelled and mislabelled data

Reply via email to