Re: [Scikit-learn-general] Improving Text Classification

Ian Ozsvald Fri, 12 Jul 2013 04:16:55 -0700

Hi Nigel. I see you're in the UK, I'm based east of you in London. My
goal with the disambiguator is to provide a well documented pipeline
such that it can be easily retrained.


I have a notion that in the future I'll host a version of my code
production-ready under my http://annotate.io/ , ready for client use
(e.g. aimed at non-ML folk who want to do accurate brand
disambiguation). The code will be based off of my open source project
(I'm really keen to see that used e.g. in academia for studies that
are more interesting than just brand analytics...could someone monitor
air-quality passively via tweets from people talking about 'choking',
'wheezy', 'trouble breathing' without confusing those terms with other
human behaviours?)

If a commercial collaboration pops up, I consult (my AI/NLP
consultancy is 7 years old), but I'll likely continue blogging my
progress so feel free just to use whatever looks useful :-) I'll also
be talking about this project at EuroSciPy next month.

i.

On 12 July 2013 07:44, Nigel Legg <nigel.l...@gmail.com> wrote:
> I am just starting down the road towards having a text classifier for social
> media posts. As this may be used in a variety of situations (currently
> negotiating 2 freelance analytics positions with research agencies), the
> classifier will need to have a mechanism for retraining on a project by
> project basis, thus needing human input each time I take on a new assignment
> - partly because the research aims of each assignment will be different, so
> there will be a need for customised categories.  I would be interested in
> any method you discover / develop for using clustering to expand the
> training set, and any learning that comes with it.
>
> Regards,
> Nigel Legg
> 07914 740972
> http://www.trevanianlegg.co.uk
> http://twitter.com/nigellegg
> http://uk.linkedin.com/in/nigellegg
>
>
>
> On 11 July 2013 17:34, Harold Nguyen <har...@nexgate.com> wrote:
>>
>> Hi Ian,
>>
>> Thank you very much for writing this message, and especially for
>> sharing your experience. I am actually
>> doing the very same thing, and would love to collaborate with you,
>> if possible. I'm not as far along in my journey as you are, but I hope
>> we can help each other in the future!
>>
>> I'm categorizing (>2 categories) tweets, facebook comments, youtube
>> comments,
>> google+ comments, etc...
>>
>> The biggest problem is not having enough labeled data. As others have
>> mentioned,
>> it seems like a great idea to spend resources into obtaining this labeled
>> data, but I also like
>> the idea of "automatic text classification," which is performed in a
>> semi-supervised way.
>>
>> The idea is that, having a few labeled data, one can do clustering of
>> documents of unlabeled data
>> to expand the amount of data in the same category. I haven't gotten far
>> with this yet, but once I have,
>> would like to circle back with you later if I learn anything interesting.
>>
>> Thanks and best wishes,
>>
>> Harold
>>
>>
>> On Thu, Jul 11, 2013 at 2:43 AM, Ian Ozsvald <i...@ianozsvald.com> wrote:
>>>
>>> Hello Mike. Could you give a summary of your problem? It sounds like
>>> you're categorising text (tweets? medical text? news articles?) into
>>> >2 categories (how many?), is that right? Is the goal really to
>>> optimise your f1 score, or maybe to only want accurate categorisations
>>> (precision) or maybe high recall?
>>>
>>> I'm working on a related problem for "brand disambiguation in social
>>> media" - basically I'm looking to disambiguate words like "apple" in
>>> tweets for brand analysis, I want all the Apple Inc mentions and none
>>> for fruit, juice, bands, pejorative, fashion etc. This is similar to
>>> spam classification (in-class, out-of-class). The project is open
>>> (I've been on this task for 5 weeks as a break from consulting work),
>>> I collected and tagged 2000 tweets for this first iteration. Code and
>>> write-ups here:
>>> https://github.com/ianozsvald/social_media_brand_disambiguator
>>> http://ianozsvald.com/category/socialmediabranddisambiguator/
>>>
>>> A very related question to my project was asked on StackOverflow, you
>>> might find the advice given by others (I also wrote a long answer)
>>> relevant:
>>>
>>> http://stackoverflow.com/questions/17352469/how-can-i-build-a-model-to-distinguish-tweets-about-apple-inc-from-tweets-abo/17502141
>>>
>>> With my tweets the messages are roughly of uniform length (so TF IDF
>>> isn't so useful) and often have only 1 mention of a term (so Binary
>>> Term Counts make more sense). Binomial NaiveBayes outperforms
>>> everything else (Logistic Regression comes second). Currently I don't
>>> do any cleaning or stemming, unigrams are 'ok' but 1-3 ngrams are
>>> best.
>>>
>>> I've haven't experimented yet with Chi2 for feature selection (beyond
>>> just eyeballing the output to see what was important). I'm not yet
>>> adding any external information (e.g. no part of speech tags, no
>>> retrieving of titles of URLs in a tweet).
>>>
>>> Using 5-fold cross validation I get pretty stable results, I'm
>>> optimising for precision and then recall (I want correct answers, and
>>> as many as possible). I test using a precision score and cross entropy
>>> error (the latter is useful for fine tweaks but isn't comparable
>>> between classifiers - the probability distributions can be very
>>> different). The result is competitive with OpenCalais (a commercial
>>> alternative) on my training sets using a held-out validation set, I
>>> don't yet know if it generalises well to other time periods (that's
>>> next on my list to test).
>>>
>>> For diagnostics I've found a plain DecisionTree to be useful - it is
>>> obvious in my trees that two features are being overfitted. First if
>>> 'http' appears then it is much more likely to be an Apple Inc tweet -
>>> this could be an artefact of the tweets in my sample period (lots of
>>> news stories). Secondly there are Vatican/Pope references that give
>>> evidence to Apple Inc - these tweets came from around the time of the
>>> new Pope announcement (e.g. jokes about "iPope" and "white smoke from
>>> Apple HQ"). These features pop out near the top of the Decision Tree
>>> and clearly look wrong. This might help with your diagnostics?
>>>
>>> I'm also diagnosing the tweets that are misclassified - I invert them
>>> through the Vectorizer to get the bag of words back. Often the poorly
>>> classified tweets have few terms which are not informative. This also
>>> might help with diagnosing your poor performance?
>>>
>>> I'd be happy to hear more about your problem as it has obvious
>>> similarities to mine,
>>> Ian.
>>>
>>> On 10 July 2013 22:26, Mike Hansen <mikehanse...@ymail.com> wrote:
>>> > I have been using Scikit's text classification for several weeks, and I
>>> > really like it.  I use my own corpus (self-generated) and prepare each
>>> > document using the NLTK.  Presently I am relying on this
>>> > tutorial/code-base,
>>> > only making changes when absolutely necessary for my documents to work.
>>> >
>>> > The problem: when working with smaller document sets (tens or
>>> > one-hundred
>>> > instead of hundreds or thousands) I am receiving really low success
>>> > rates
>>> > (the f1-score).
>>> >
>>> > Question: What are the areas/options I should be researching and
>>> > attempting
>>> > to implement to improve my categorization success rates?
>>> >
>>> > Additional information: using the NLTK, I am removing stopwords and
>>> > stemming
>>> > each word before the document is classified.  Using Scikit, I am using
>>> > an
>>> > n-gram range of one to four (this made a significant difference when I
>>> > had
>>> > larger sample sets, but very little difference now that I am working on
>>> > "sub-categories").
>>> >
>>> > Any help would be enormously appreciated.  I am willing to research and
>>> > try
>>> > any technique recommended, so if you have one, please don't hesitate.
>>> > Thank
>>> > you very much.
>>> >
>>> > Warmest regards,
>>> >
>>> > Mike
>>> >
>>> >
>>> > ------------------------------------------------------------------------------
>>> > See everything from the browser to the database with AppDynamics
>>> > Get end-to-end visibility with application monitoring from AppDynamics
>>> > Isolate bottlenecks and diagnose root cause in seconds.
>>> > Start your free trial of AppDynamics Pro today!
>>> >
>>> > http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
>>> > _______________________________________________
>>> > Scikit-learn-general mailing list
>>> > Scikit-learn-general@lists.sourceforge.net
>>> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>> >
>>>
>>>
>>>
>>> --
>>> Ian Ozsvald (A.I. researcher)
>>> i...@ianozsvald.com
>>>
>>> http://IanOzsvald.com
>>> http://MorConsulting.com/
>>> http://Annotate.IO
>>> http://SocialTiesApp.com/
>>> http://TheScreencastingHandbook.com
>>> http://FivePoundApp.com/
>>> http://twitter.com/IanOzsvald
>>> http://ShowMeDo.com
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> See everything from the browser to the database with AppDynamics
>>> Get end-to-end visibility with application monitoring from AppDynamics
>>> Isolate bottlenecks and diagnose root cause in seconds.
>>> Start your free trial of AppDynamics Pro today!
>>>
>>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> Scikit-learn-general@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>>
>>
>> ------------------------------------------------------------------------------
>> See everything from the browser to the database with AppDynamics
>> Get end-to-end visibility with application monitoring from AppDynamics
>> Isolate bottlenecks and diagnose root cause in seconds.
>> Start your free trial of AppDynamics Pro today!
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>
>
> ------------------------------------------------------------------------------
> See everything from the browser to the database with AppDynamics
> Get end-to-end visibility with application monitoring from AppDynamics
> Isolate bottlenecks and diagnose root cause in seconds.
> Start your free trial of AppDynamics Pro today!
> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>



-- 
Ian Ozsvald (A.I. researcher)
i...@ianozsvald.com

http://IanOzsvald.com
http://MorConsulting.com/
http://Annotate.IO
http://SocialTiesApp.com/
http://TheScreencastingHandbook.com
http://FivePoundApp.com/
http://twitter.com/IanOzsvald
http://ShowMeDo.com

------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Improving Text Classification

Reply via email to