Re: [Scikit-learn-general] Improving Text Classification

Harold Nguyen Thu, 11 Jul 2013 10:01:28 -0700

Hi Ian,

Thank you very much for writing this message, and especially for
sharing your experience. I am actually
doing the very same thing, and would love to collaborate with you,
if possible. I'm not as far along in my journey as you are, but I hope
we can help each other in the future!


I'm categorizing (>2 categories) tweets, facebook comments, youtube
comments,
google+ comments, etc...

The biggest problem is not having enough labeled data. As others have
mentioned,
it seems like a great idea to spend resources into obtaining this labeled
data, but I also like
the idea of "automatic text classification," which is performed in a
semi-supervised way.

The idea is that, having a few labeled data, one can do clustering of
documents of unlabeled data
to expand the amount of data in the same category. I haven't gotten far
with this yet, but once I have,
would like to circle back with you later if I learn anything interesting.

Thanks and best wishes,

Harold


On Thu, Jul 11, 2013 at 2:43 AM, Ian Ozsvald <i...@ianozsvald.com> wrote:

> Hello Mike. Could you give a summary of your problem? It sounds like
> you're categorising text (tweets? medical text? news articles?) into
> >2 categories (how many?), is that right? Is the goal really to
> optimise your f1 score, or maybe to only want accurate categorisations
> (precision) or maybe high recall?
>
> I'm working on a related problem for "brand disambiguation in social
> media" - basically I'm looking to disambiguate words like "apple" in
> tweets for brand analysis, I want all the Apple Inc mentions and none
> for fruit, juice, bands, pejorative, fashion etc. This is similar to
> spam classification (in-class, out-of-class). The project is open
> (I've been on this task for 5 weeks as a break from consulting work),
> I collected and tagged 2000 tweets for this first iteration. Code and
> write-ups here:
> https://github.com/ianozsvald/social_media_brand_disambiguator
> http://ianozsvald.com/category/socialmediabranddisambiguator/
>
> A very related question to my project was asked on StackOverflow, you
> might find the advice given by others (I also wrote a long answer)
> relevant:
>
> http://stackoverflow.com/questions/17352469/how-can-i-build-a-model-to-distinguish-tweets-about-apple-inc-from-tweets-abo/17502141
>
> With my tweets the messages are roughly of uniform length (so TF IDF
> isn't so useful) and often have only 1 mention of a term (so Binary
> Term Counts make more sense). Binomial NaiveBayes outperforms
> everything else (Logistic Regression comes second). Currently I don't
> do any cleaning or stemming, unigrams are 'ok' but 1-3 ngrams are
> best.
>
> I've haven't experimented yet with Chi2 for feature selection (beyond
> just eyeballing the output to see what was important). I'm not yet
> adding any external information (e.g. no part of speech tags, no
> retrieving of titles of URLs in a tweet).
>
> Using 5-fold cross validation I get pretty stable results, I'm
> optimising for precision and then recall (I want correct answers, and
> as many as possible). I test using a precision score and cross entropy
> error (the latter is useful for fine tweaks but isn't comparable
> between classifiers - the probability distributions can be very
> different). The result is competitive with OpenCalais (a commercial
> alternative) on my training sets using a held-out validation set, I
> don't yet know if it generalises well to other time periods (that's
> next on my list to test).
>
> For diagnostics I've found a plain DecisionTree to be useful - it is
> obvious in my trees that two features are being overfitted. First if
> 'http' appears then it is much more likely to be an Apple Inc tweet -
> this could be an artefact of the tweets in my sample period (lots of
> news stories). Secondly there are Vatican/Pope references that give
> evidence to Apple Inc - these tweets came from around the time of the
> new Pope announcement (e.g. jokes about "iPope" and "white smoke from
> Apple HQ"). These features pop out near the top of the Decision Tree
> and clearly look wrong. This might help with your diagnostics?
>
> I'm also diagnosing the tweets that are misclassified - I invert them
> through the Vectorizer to get the bag of words back. Often the poorly
> classified tweets have few terms which are not informative. This also
> might help with diagnosing your poor performance?
>
> I'd be happy to hear more about your problem as it has obvious
> similarities to mine,
> Ian.
>
> On 10 July 2013 22:26, Mike Hansen <mikehanse...@ymail.com> wrote:
> > I have been using Scikit's text classification for several weeks, and I
> > really like it.  I use my own corpus (self-generated) and prepare each
> > document using the NLTK.  Presently I am relying on this
> tutorial/code-base,
> > only making changes when absolutely necessary for my documents to work.
> >
> > The problem: when working with smaller document sets (tens or one-hundred
> > instead of hundreds or thousands) I am receiving really low success rates
> > (the f1-score).
> >
> > Question: What are the areas/options I should be researching and
> attempting
> > to implement to improve my categorization success rates?
> >
> > Additional information: using the NLTK, I am removing stopwords and
> stemming
> > each word before the document is classified.  Using Scikit, I am using an
> > n-gram range of one to four (this made a significant difference when I
> had
> > larger sample sets, but very little difference now that I am working on
> > "sub-categories").
> >
> > Any help would be enormously appreciated.  I am willing to research and
> try
> > any technique recommended, so if you have one, please don't hesitate.
>  Thank
> > you very much.
> >
> > Warmest regards,
> >
> > Mike
> >
> >
> ------------------------------------------------------------------------------
> > See everything from the browser to the database with AppDynamics
> > Get end-to-end visibility with application monitoring from AppDynamics
> > Isolate bottlenecks and diagnose root cause in seconds.
> > Start your free trial of AppDynamics Pro today!
> >
> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
> > _______________________________________________
> > Scikit-learn-general mailing list
> > Scikit-learn-general@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >
>
>
>
> --
> Ian Ozsvald (A.I. researcher)
> i...@ianozsvald.com
>
> http://IanOzsvald.com
> http://MorConsulting.com/
> http://Annotate.IO
> http://SocialTiesApp.com/
> http://TheScreencastingHandbook.com
> http://FivePoundApp.com/
> http://twitter.com/IanOzsvald
> http://ShowMeDo.com
>
>
> ------------------------------------------------------------------------------
> See everything from the browser to the database with AppDynamics
> Get end-to-end visibility with application monitoring from AppDynamics
> Isolate bottlenecks and diagnose root cause in seconds.
> Start your free trial of AppDynamics Pro today!
> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>

------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Improving Text Classification

Reply via email to