Hello Mike. Could you give a summary of your problem? It sounds like you're categorising text (tweets? medical text? news articles?) into >2 categories (how many?), is that right? Is the goal really to optimise your f1 score, or maybe to only want accurate categorisations (precision) or maybe high recall?
I'm working on a related problem for "brand disambiguation in social media" - basically I'm looking to disambiguate words like "apple" in tweets for brand analysis, I want all the Apple Inc mentions and none for fruit, juice, bands, pejorative, fashion etc. This is similar to spam classification (in-class, out-of-class). The project is open (I've been on this task for 5 weeks as a break from consulting work), I collected and tagged 2000 tweets for this first iteration. Code and write-ups here: https://github.com/ianozsvald/social_media_brand_disambiguator http://ianozsvald.com/category/socialmediabranddisambiguator/ A very related question to my project was asked on StackOverflow, you might find the advice given by others (I also wrote a long answer) relevant: http://stackoverflow.com/questions/17352469/how-can-i-build-a-model-to-distinguish-tweets-about-apple-inc-from-tweets-abo/17502141 With my tweets the messages are roughly of uniform length (so TF IDF isn't so useful) and often have only 1 mention of a term (so Binary Term Counts make more sense). Binomial NaiveBayes outperforms everything else (Logistic Regression comes second). Currently I don't do any cleaning or stemming, unigrams are 'ok' but 1-3 ngrams are best. I've haven't experimented yet with Chi2 for feature selection (beyond just eyeballing the output to see what was important). I'm not yet adding any external information (e.g. no part of speech tags, no retrieving of titles of URLs in a tweet). Using 5-fold cross validation I get pretty stable results, I'm optimising for precision and then recall (I want correct answers, and as many as possible). I test using a precision score and cross entropy error (the latter is useful for fine tweaks but isn't comparable between classifiers - the probability distributions can be very different). The result is competitive with OpenCalais (a commercial alternative) on my training sets using a held-out validation set, I don't yet know if it generalises well to other time periods (that's next on my list to test). For diagnostics I've found a plain DecisionTree to be useful - it is obvious in my trees that two features are being overfitted. First if 'http' appears then it is much more likely to be an Apple Inc tweet - this could be an artefact of the tweets in my sample period (lots of news stories). Secondly there are Vatican/Pope references that give evidence to Apple Inc - these tweets came from around the time of the new Pope announcement (e.g. jokes about "iPope" and "white smoke from Apple HQ"). These features pop out near the top of the Decision Tree and clearly look wrong. This might help with your diagnostics? I'm also diagnosing the tweets that are misclassified - I invert them through the Vectorizer to get the bag of words back. Often the poorly classified tweets have few terms which are not informative. This also might help with diagnosing your poor performance? I'd be happy to hear more about your problem as it has obvious similarities to mine, Ian. On 10 July 2013 22:26, Mike Hansen <mikehanse...@ymail.com> wrote: > I have been using Scikit's text classification for several weeks, and I > really like it. I use my own corpus (self-generated) and prepare each > document using the NLTK. Presently I am relying on this tutorial/code-base, > only making changes when absolutely necessary for my documents to work. > > The problem: when working with smaller document sets (tens or one-hundred > instead of hundreds or thousands) I am receiving really low success rates > (the f1-score). > > Question: What are the areas/options I should be researching and attempting > to implement to improve my categorization success rates? > > Additional information: using the NLTK, I am removing stopwords and stemming > each word before the document is classified. Using Scikit, I am using an > n-gram range of one to four (this made a significant difference when I had > larger sample sets, but very little difference now that I am working on > "sub-categories"). > > Any help would be enormously appreciated. I am willing to research and try > any technique recommended, so if you have one, please don't hesitate. Thank > you very much. > > Warmest regards, > > Mike > > ------------------------------------------------------------------------------ > See everything from the browser to the database with AppDynamics > Get end-to-end visibility with application monitoring from AppDynamics > Isolate bottlenecks and diagnose root cause in seconds. > Start your free trial of AppDynamics Pro today! > http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > -- Ian Ozsvald (A.I. researcher) i...@ianozsvald.com http://IanOzsvald.com http://MorConsulting.com/ http://Annotate.IO http://SocialTiesApp.com/ http://TheScreencastingHandbook.com http://FivePoundApp.com/ http://twitter.com/IanOzsvald http://ShowMeDo.com ------------------------------------------------------------------------------ See everything from the browser to the database with AppDynamics Get end-to-end visibility with application monitoring from AppDynamics Isolate bottlenecks and diagnose root cause in seconds. Start your free trial of AppDynamics Pro today! http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general