Re: [Scikit-learn-general] Improving Text Classification

Ian Ozsvald Fri, 12 Jul 2013 04:10:46 -0700

Hi Harold. Are you using different models for the different types of
social media? I'd guess that the grammar/terms used in a tweet could
look quite different to what you see in e.g. a Google+ Comment
(different demographic->probably higher quality English, less space
restrictions->longer/clearer wording more likely). I'd guess that
FB/G+ might have similar text, YouTube is specific to video ratings
(and has a wide demographic with varying qualities of English), Tweets
are short and so pack a lot of context into few characters.


Re. labelled data - yes, get lots of it. More data probably beats all
the other tweaks you could come up with. I have millions of tweets
here for 'apple', I'm about to experiment with building models based
on rule-derived classifications (e.g. if 'iphone' then is-Apple-Inc,
if 'apple juice' then is-not-Apple-inc). This, possibly with some
semi-supervised classification and human intervention, could be a
powerful mechanism.

The alternative is to employ an intern/Mechanical Turk/CrowdFlower to
get cheap human-hours behind the classification task (maybe e.g. using
3 human votes per message to average out mistakes/foolishness). I've
looked into this before but haven't tried it in anger yet.

i.

On 11 July 2013 17:34, Harold Nguyen <har...@nexgate.com> wrote:
> Hi Ian,
>
> Thank you very much for writing this message, and especially for
> sharing your experience. I am actually
> doing the very same thing, and would love to collaborate with you,
> if possible. I'm not as far along in my journey as you are, but I hope
> we can help each other in the future!
>
> I'm categorizing (>2 categories) tweets, facebook comments, youtube
> comments,
> google+ comments, etc...
>
> The biggest problem is not having enough labeled data. As others have
> mentioned,
> it seems like a great idea to spend resources into obtaining this labeled
> data, but I also like
> the idea of "automatic text classification," which is performed in a
> semi-supervised way.
>
> The idea is that, having a few labeled data, one can do clustering of
> documents of unlabeled data
> to expand the amount of data in the same category. I haven't gotten far with
> this yet, but once I have,
> would like to circle back with you later if I learn anything interesting.
>
> Thanks and best wishes,
>
> Harold
>
>
> On Thu, Jul 11, 2013 at 2:43 AM, Ian Ozsvald <i...@ianozsvald.com> wrote:
>>
>> Hello Mike. Could you give a summary of your problem? It sounds like
>> you're categorising text (tweets? medical text? news articles?) into
>> >2 categories (how many?), is that right? Is the goal really to
>> optimise your f1 score, or maybe to only want accurate categorisations
>> (precision) or maybe high recall?
>>
>> I'm working on a related problem for "brand disambiguation in social
>> media" - basically I'm looking to disambiguate words like "apple" in
>> tweets for brand analysis, I want all the Apple Inc mentions and none
>> for fruit, juice, bands, pejorative, fashion etc. This is similar to
>> spam classification (in-class, out-of-class). The project is open
>> (I've been on this task for 5 weeks as a break from consulting work),
>> I collected and tagged 2000 tweets for this first iteration. Code and
>> write-ups here:
>> https://github.com/ianozsvald/social_media_brand_disambiguator
>> http://ianozsvald.com/category/socialmediabranddisambiguator/
>>
>> A very related question to my project was asked on StackOverflow, you
>> might find the advice given by others (I also wrote a long answer)
>> relevant:
>>
>> http://stackoverflow.com/questions/17352469/how-can-i-build-a-model-to-distinguish-tweets-about-apple-inc-from-tweets-abo/17502141
>>
>> With my tweets the messages are roughly of uniform length (so TF IDF
>> isn't so useful) and often have only 1 mention of a term (so Binary
>> Term Counts make more sense). Binomial NaiveBayes outperforms
>> everything else (Logistic Regression comes second). Currently I don't
>> do any cleaning or stemming, unigrams are 'ok' but 1-3 ngrams are
>> best.
>>
>> I've haven't experimented yet with Chi2 for feature selection (beyond
>> just eyeballing the output to see what was important). I'm not yet
>> adding any external information (e.g. no part of speech tags, no
>> retrieving of titles of URLs in a tweet).
>>
>> Using 5-fold cross validation I get pretty stable results, I'm
>> optimising for precision and then recall (I want correct answers, and
>> as many as possible). I test using a precision score and cross entropy
>> error (the latter is useful for fine tweaks but isn't comparable
>> between classifiers - the probability distributions can be very
>> different). The result is competitive with OpenCalais (a commercial
>> alternative) on my training sets using a held-out validation set, I
>> don't yet know if it generalises well to other time periods (that's
>> next on my list to test).
>>
>> For diagnostics I've found a plain DecisionTree to be useful - it is
>> obvious in my trees that two features are being overfitted. First if
>> 'http' appears then it is much more likely to be an Apple Inc tweet -
>> this could be an artefact of the tweets in my sample period (lots of
>> news stories). Secondly there are Vatican/Pope references that give
>> evidence to Apple Inc - these tweets came from around the time of the
>> new Pope announcement (e.g. jokes about "iPope" and "white smoke from
>> Apple HQ"). These features pop out near the top of the Decision Tree
>> and clearly look wrong. This might help with your diagnostics?
>>
>> I'm also diagnosing the tweets that are misclassified - I invert them
>> through the Vectorizer to get the bag of words back. Often the poorly
>> classified tweets have few terms which are not informative. This also
>> might help with diagnosing your poor performance?
>>
>> I'd be happy to hear more about your problem as it has obvious
>> similarities to mine,
>> Ian.
>>
>> On 10 July 2013 22:26, Mike Hansen <mikehanse...@ymail.com> wrote:
>> > I have been using Scikit's text classification for several weeks, and I
>> > really like it.  I use my own corpus (self-generated) and prepare each
>> > document using the NLTK.  Presently I am relying on this
>> > tutorial/code-base,
>> > only making changes when absolutely necessary for my documents to work.
>> >
>> > The problem: when working with smaller document sets (tens or
>> > one-hundred
>> > instead of hundreds or thousands) I am receiving really low success
>> > rates
>> > (the f1-score).
>> >
>> > Question: What are the areas/options I should be researching and
>> > attempting
>> > to implement to improve my categorization success rates?
>> >
>> > Additional information: using the NLTK, I am removing stopwords and
>> > stemming
>> > each word before the document is classified.  Using Scikit, I am using
>> > an
>> > n-gram range of one to four (this made a significant difference when I
>> > had
>> > larger sample sets, but very little difference now that I am working on
>> > "sub-categories").
>> >
>> > Any help would be enormously appreciated.  I am willing to research and
>> > try
>> > any technique recommended, so if you have one, please don't hesitate.
>> > Thank
>> > you very much.
>> >
>> > Warmest regards,
>> >
>> > Mike
>> >
>> >
>> > ------------------------------------------------------------------------------
>> > See everything from the browser to the database with AppDynamics
>> > Get end-to-end visibility with application monitoring from AppDynamics
>> > Isolate bottlenecks and diagnose root cause in seconds.
>> > Start your free trial of AppDynamics Pro today!
>> >
>> > http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
>> > _______________________________________________
>> > Scikit-learn-general mailing list
>> > Scikit-learn-general@lists.sourceforge.net
>> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>> >
>>
>>
>>
>> --
>> Ian Ozsvald (A.I. researcher)
>> i...@ianozsvald.com
>>
>> http://IanOzsvald.com
>> http://MorConsulting.com/
>> http://Annotate.IO
>> http://SocialTiesApp.com/
>> http://TheScreencastingHandbook.com
>> http://FivePoundApp.com/
>> http://twitter.com/IanOzsvald
>> http://ShowMeDo.com
>>
>>
>> ------------------------------------------------------------------------------
>> See everything from the browser to the database with AppDynamics
>> Get end-to-end visibility with application monitoring from AppDynamics
>> Isolate bottlenecks and diagnose root cause in seconds.
>> Start your free trial of AppDynamics Pro today!
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>



-- 
Ian Ozsvald (A.I. researcher)
i...@ianozsvald.com

http://IanOzsvald.com
http://MorConsulting.com/
http://Annotate.IO
http://SocialTiesApp.com/
http://TheScreencastingHandbook.com
http://FivePoundApp.com/
http://twitter.com/IanOzsvald
http://ShowMeDo.com

------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Improving Text Classification

Reply via email to