Hello Mike. Could you give a summary of your problem? It sounds like
you're categorising text (tweets? medical text? news articles?) into
>2 categories (how many?), is that right? Is the goal really to
optimise your f1 score, or maybe to only want accurate categorisations
(precision) or maybe high recall?

I'm working on a related problem for "brand disambiguation in social
media" - basically I'm looking to disambiguate words like "apple" in
tweets for brand analysis, I want all the Apple Inc mentions and none
for fruit, juice, bands, pejorative, fashion etc. This is similar to
spam classification (in-class, out-of-class). The project is open
(I've been on this task for 5 weeks as a break from consulting work),
I collected and tagged 2000 tweets for this first iteration. Code and
write-ups here:
https://github.com/ianozsvald/social_media_brand_disambiguator
http://ianozsvald.com/category/socialmediabranddisambiguator/

A very related question to my project was asked on StackOverflow, you
might find the advice given by others (I also wrote a long answer)
relevant:
http://stackoverflow.com/questions/17352469/how-can-i-build-a-model-to-distinguish-tweets-about-apple-inc-from-tweets-abo/17502141

With my tweets the messages are roughly of uniform length (so TF IDF
isn't so useful) and often have only 1 mention of a term (so Binary
Term Counts make more sense). Binomial NaiveBayes outperforms
everything else (Logistic Regression comes second). Currently I don't
do any cleaning or stemming, unigrams are 'ok' but 1-3 ngrams are
best.

I've haven't experimented yet with Chi2 for feature selection (beyond
just eyeballing the output to see what was important). I'm not yet
adding any external information (e.g. no part of speech tags, no
retrieving of titles of URLs in a tweet).

Using 5-fold cross validation I get pretty stable results, I'm
optimising for precision and then recall (I want correct answers, and
as many as possible). I test using a precision score and cross entropy
error (the latter is useful for fine tweaks but isn't comparable
between classifiers - the probability distributions can be very
different). The result is competitive with OpenCalais (a commercial
alternative) on my training sets using a held-out validation set, I
don't yet know if it generalises well to other time periods (that's
next on my list to test).

For diagnostics I've found a plain DecisionTree to be useful - it is
obvious in my trees that two features are being overfitted. First if
'http' appears then it is much more likely to be an Apple Inc tweet -
this could be an artefact of the tweets in my sample period (lots of
news stories). Secondly there are Vatican/Pope references that give
evidence to Apple Inc - these tweets came from around the time of the
new Pope announcement (e.g. jokes about "iPope" and "white smoke from
Apple HQ"). These features pop out near the top of the Decision Tree
and clearly look wrong. This might help with your diagnostics?

I'm also diagnosing the tweets that are misclassified - I invert them
through the Vectorizer to get the bag of words back. Often the poorly
classified tweets have few terms which are not informative. This also
might help with diagnosing your poor performance?

I'd be happy to hear more about your problem as it has obvious
similarities to mine,
Ian.

On 10 July 2013 22:26, Mike Hansen <mikehanse...@ymail.com> wrote:
> I have been using Scikit's text classification for several weeks, and I
> really like it.  I use my own corpus (self-generated) and prepare each
> document using the NLTK.  Presently I am relying on this tutorial/code-base,
> only making changes when absolutely necessary for my documents to work.
>
> The problem: when working with smaller document sets (tens or one-hundred
> instead of hundreds or thousands) I am receiving really low success rates
> (the f1-score).
>
> Question: What are the areas/options I should be researching and attempting
> to implement to improve my categorization success rates?
>
> Additional information: using the NLTK, I am removing stopwords and stemming
> each word before the document is classified.  Using Scikit, I am using an
> n-gram range of one to four (this made a significant difference when I had
> larger sample sets, but very little difference now that I am working on
> "sub-categories").
>
> Any help would be enormously appreciated.  I am willing to research and try
> any technique recommended, so if you have one, please don't hesitate.  Thank
> you very much.
>
> Warmest regards,
>
> Mike
>
> ------------------------------------------------------------------------------
> See everything from the browser to the database with AppDynamics
> Get end-to-end visibility with application monitoring from AppDynamics
> Isolate bottlenecks and diagnose root cause in seconds.
> Start your free trial of AppDynamics Pro today!
> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>



-- 
Ian Ozsvald (A.I. researcher)
i...@ianozsvald.com

http://IanOzsvald.com
http://MorConsulting.com/
http://Annotate.IO
http://SocialTiesApp.com/
http://TheScreencastingHandbook.com
http://FivePoundApp.com/
http://twitter.com/IanOzsvald
http://ShowMeDo.com

------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to