2013/9/12 Ark <4rk....@gmail.com>: > > Upon moving to the machine learning approach, we realised the issue > is misclassification or even duplication of categories in some cases. > In order to get a rough estimate I was thinking of a clustering approach > like k means, however since the number of categories might be less than > 3000, does this seem to be the correct approach? Or if there a better > solution, would certainly appreciate pointers.
There is no guarantee that the categories found by k-means will match the categories you are really interested in. For instance consider a corpus of comments on a sports websites for 3 sports: - football, - tennis, - cricket. Assume you are interested in categorizing posts according to 3 sentiment categories: - positive - neutral - negative If the positive and negative words are more frequent and discriminative than sport-specific words, then k-means cluster with k=3 might help get you started. However it's very likely that the sport specific words will be more discriminative and k-means will rather base its clustering on those. In real life it's even more likely that there will be correlations between sentiments and sports (and even with other non-sport / non-sentiments related topics) for instance if: - tennis comments are more neutral on average than football comments, - football is more negative than tennis and cricket, - cricket is always positive. In that case k-means with k=3 will learn cluster that mix and match sentiment specific words with sports specific words. Such spurious correlations are very likely to happen if your corpus has some categories with very little number of example documents. What you could do on the other hand is: - get rid of all the categories where you have less than 50 documents (or start labeling new documents specifically for those categories by using a keyword search with a blend of very specific keywords for each categories), - train a first model on your existing corpus and then start an active learning loop by running the classifier on a batch of unlabeled documents to manually label new documents where the classifier is the less certain (you can choose a classifier with a decision_function or a predict_proba method to find out about the most un-confident predictions). - it might help to add a "garbage" category for all the documents that don't match any of the categories you are really interested in. It might be required to help a multiclass classifier get less corpus-specific bias. You can initialize this category with random text from wikipedia or the web. If you plan to seriously increase the number of documents in your corpus you could also try a Rocchio classifier [1] or a k-NN classifier. For large text documents collections it's probably more interesting to implement them can be implemented on top of search engine such as solr or elastic search with similarity queries. k-NN on top of solr can be implemented as. Here is some toy code I did for a k-NN classifier on wikipedia articles using Solr: https://github.com/ogrisel/pignlproc/blob/master/examples/topic-corpus/categorize.py [1] http://en.wikipedia.org/wiki/Rocchio_algorithm -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ How ServiceNow helps IT people transform IT departments: 1. Consolidate legacy IT systems to a single system of record for IT 2. Standardize and globalize service processes across IT 3. Implement zero-touch automation to replace manual, redundant tasks http://pubads.g.doubleclick.net/gampad/clk?id=51271111&iu=/4140/ostg.clktrk _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general