Re: [Scikit-learn-general] Improving Text Classification

Nigel Legg Fri, 12 Jul 2013 04:41:12 -0700

I'm coming at this from a market research point of view (that's my
background).  There seem to be a number of opportunities there for
classificaton, clustering, and regression analysis tools, so I am building
- or rather attempting to build - tools with the aim that they will go on
the web, and people will be able to use them themselves.  Following the
"nowcasting flu" work from the University of Bristol -see
http://geopatterns.enm.bris.ac.uk/epidemics/twitter-flu-archive.php?region=EngW-
I have been interested in working with predictive analytics for some
time, and this seems to be my opportunity.
I will let you know how things develop.
Cheers, N//


Regards,
Nigel Legg
07914 740972
http://www.trevanianlegg.co.uk
http://twitter.com/nigellegg
http://uk.linkedin.com/in/nigellegg



On 12 July 2013 12:14, Ian Ozsvald <i...@ianozsvald.com> wrote:

> Hi Nigel. I see you're in the UK, I'm based east of you in London. My
> goal with the disambiguator is to provide a well documented pipeline
> such that it can be easily retrained.
>
> I have a notion that in the future I'll host a version of my code
> production-ready under my http://annotate.io/ , ready for client use
> (e.g. aimed at non-ML folk who want to do accurate brand
> disambiguation). The code will be based off of my open source project
> (I'm really keen to see that used e.g. in academia for studies that
> are more interesting than just brand analytics...could someone monitor
> air-quality passively via tweets from people talking about 'choking',
> 'wheezy', 'trouble breathing' without confusing those terms with other
> human behaviours?)
>
> If a commercial collaboration pops up, I consult (my AI/NLP
> consultancy is 7 years old), but I'll likely continue blogging my
> progress so feel free just to use whatever looks useful :-) I'll also
> be talking about this project at EuroSciPy next month.
>
> i.
>
> On 12 July 2013 07:44, Nigel Legg <nigel.l...@gmail.com> wrote:
> > I am just starting down the road towards having a text classifier for
> social
> > media posts. As this may be used in a variety of situations (currently
> > negotiating 2 freelance analytics positions with research agencies), the
> > classifier will need to have a mechanism for retraining on a project by
> > project basis, thus needing human input each time I take on a new
> assignment
> > - partly because the research aims of each assignment will be different,
> so
> > there will be a need for customised categories.  I would be interested in
> > any method you discover / develop for using clustering to expand the
> > training set, and any learning that comes with it.
> >
> > Regards,
> > Nigel Legg
> > 07914 740972
> > http://www.trevanianlegg.co.uk
> > http://twitter.com/nigellegg
> > http://uk.linkedin.com/in/nigellegg
> >
> >
> >
> > On 11 July 2013 17:34, Harold Nguyen <har...@nexgate.com> wrote:
> >>
> >> Hi Ian,
> >>
> >> Thank you very much for writing this message, and especially for
> >> sharing your experience. I am actually
> >> doing the very same thing, and would love to collaborate with you,
> >> if possible. I'm not as far along in my journey as you are, but I hope
> >> we can help each other in the future!
> >>
> >> I'm categorizing (>2 categories) tweets, facebook comments, youtube
> >> comments,
> >> google+ comments, etc...
> >>
> >> The biggest problem is not having enough labeled data. As others have
> >> mentioned,
> >> it seems like a great idea to spend resources into obtaining this
> labeled
> >> data, but I also like
> >> the idea of "automatic text classification," which is performed in a
> >> semi-supervised way.
> >>
> >> The idea is that, having a few labeled data, one can do clustering of
> >> documents of unlabeled data
> >> to expand the amount of data in the same category. I haven't gotten far
> >> with this yet, but once I have,
> >> would like to circle back with you later if I learn anything
> interesting.
> >>
> >> Thanks and best wishes,
> >>
> >> Harold
> >>
> >>
> >> On Thu, Jul 11, 2013 at 2:43 AM, Ian Ozsvald <i...@ianozsvald.com>
> wrote:
> >>>
> >>> Hello Mike. Could you give a summary of your problem? It sounds like
> >>> you're categorising text (tweets? medical text? news articles?) into
> >>> >2 categories (how many?), is that right? Is the goal really to
> >>> optimise your f1 score, or maybe to only want accurate categorisations
> >>> (precision) or maybe high recall?
> >>>
> >>> I'm working on a related problem for "brand disambiguation in social
> >>> media" - basically I'm looking to disambiguate words like "apple" in
> >>> tweets for brand analysis, I want all the Apple Inc mentions and none
> >>> for fruit, juice, bands, pejorative, fashion etc. This is similar to
> >>> spam classification (in-class, out-of-class). The project is open
> >>> (I've been on this task for 5 weeks as a break from consulting work),
> >>> I collected and tagged 2000 tweets for this first iteration. Code and
> >>> write-ups here:
> >>> https://github.com/ianozsvald/social_media_brand_disambiguator
> >>> http://ianozsvald.com/category/socialmediabranddisambiguator/
> >>>
> >>> A very related question to my project was asked on StackOverflow, you
> >>> might find the advice given by others (I also wrote a long answer)
> >>> relevant:
> >>>
> >>>
> http://stackoverflow.com/questions/17352469/how-can-i-build-a-model-to-distinguish-tweets-about-apple-inc-from-tweets-abo/17502141
> >>>
> >>> With my tweets the messages are roughly of uniform length (so TF IDF
> >>> isn't so useful) and often have only 1 mention of a term (so Binary
> >>> Term Counts make more sense). Binomial NaiveBayes outperforms
> >>> everything else (Logistic Regression comes second). Currently I don't
> >>> do any cleaning or stemming, unigrams are 'ok' but 1-3 ngrams are
> >>> best.
> >>>
> >>> I've haven't experimented yet with Chi2 for feature selection (beyond
> >>> just eyeballing the output to see what was important). I'm not yet
> >>> adding any external information (e.g. no part of speech tags, no
> >>> retrieving of titles of URLs in a tweet).
> >>>
> >>> Using 5-fold cross validation I get pretty stable results, I'm
> >>> optimising for precision and then recall (I want correct answers, and
> >>> as many as possible). I test using a precision score and cross entropy
> >>> error (the latter is useful for fine tweaks but isn't comparable
> >>> between classifiers - the probability distributions can be very
> >>> different). The result is competitive with OpenCalais (a commercial
> >>> alternative) on my training sets using a held-out validation set, I
> >>> don't yet know if it generalises well to other time periods (that's
> >>> next on my list to test).
> >>>
> >>> For diagnostics I've found a plain DecisionTree to be useful - it is
> >>> obvious in my trees that two features are being overfitted. First if
> >>> 'http' appears then it is much more likely to be an Apple Inc tweet -
> >>> this could be an artefact of the tweets in my sample period (lots of
> >>> news stories). Secondly there are Vatican/Pope references that give
> >>> evidence to Apple Inc - these tweets came from around the time of the
> >>> new Pope announcement (e.g. jokes about "iPope" and "white smoke from
> >>> Apple HQ"). These features pop out near the top of the Decision Tree
> >>> and clearly look wrong. This might help with your diagnostics?
> >>>
> >>> I'm also diagnosing the tweets that are misclassified - I invert them
> >>> through the Vectorizer to get the bag of words back. Often the poorly
> >>> classified tweets have few terms which are not informative. This also
> >>> might help with diagnosing your poor performance?
> >>>
> >>> I'd be happy to hear more about your problem as it has obvious
> >>> similarities to mine,
> >>> Ian.
> >>>
> >>> On 10 July 2013 22:26, Mike Hansen <mikehanse...@ymail.com> wrote:
> >>> > I have been using Scikit's text classification for several weeks,
> and I
> >>> > really like it.  I use my own corpus (self-generated) and prepare
> each
> >>> > document using the NLTK.  Presently I am relying on this
> >>> > tutorial/code-base,
> >>> > only making changes when absolutely necessary for my documents to
> work.
> >>> >
> >>> > The problem: when working with smaller document sets (tens or
> >>> > one-hundred
> >>> > instead of hundreds or thousands) I am receiving really low success
> >>> > rates
> >>> > (the f1-score).
> >>> >
> >>> > Question: What are the areas/options I should be researching and
> >>> > attempting
> >>> > to implement to improve my categorization success rates?
> >>> >
> >>> > Additional information: using the NLTK, I am removing stopwords and
> >>> > stemming
> >>> > each word before the document is classified.  Using Scikit, I am
> using
> >>> > an
> >>> > n-gram range of one to four (this made a significant difference when
> I
> >>> > had
> >>> > larger sample sets, but very little difference now that I am working
> on
> >>> > "sub-categories").
> >>> >
> >>> > Any help would be enormously appreciated.  I am willing to research
> and
> >>> > try
> >>> > any technique recommended, so if you have one, please don't hesitate.
> >>> > Thank
> >>> > you very much.
> >>> >
> >>> > Warmest regards,
> >>> >
> >>> > Mike
> >>> >
> >>> >
> >>> >
> ------------------------------------------------------------------------------
> >>> > See everything from the browser to the database with AppDynamics
> >>> > Get end-to-end visibility with application monitoring from
> AppDynamics
> >>> > Isolate bottlenecks and diagnose root cause in seconds.
> >>> > Start your free trial of AppDynamics Pro today!
> >>> >
> >>> >
> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
> >>> > _______________________________________________
> >>> > Scikit-learn-general mailing list
> >>> > Scikit-learn-general@lists.sourceforge.net
> >>> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >>> >
> >>>
> >>>
> >>>
> >>> --
> >>> Ian Ozsvald (A.I. researcher)
> >>> i...@ianozsvald.com
> >>>
> >>> http://IanOzsvald.com
> >>> http://MorConsulting.com/
> >>> http://Annotate.IO
> >>> http://SocialTiesApp.com/
> >>> http://TheScreencastingHandbook.com
> >>> http://FivePoundApp.com/
> >>> http://twitter.com/IanOzsvald
> >>> http://ShowMeDo.com
> >>>
> >>>
> >>>
> ------------------------------------------------------------------------------
> >>> See everything from the browser to the database with AppDynamics
> >>> Get end-to-end visibility with application monitoring from AppDynamics
> >>> Isolate bottlenecks and diagnose root cause in seconds.
> >>> Start your free trial of AppDynamics Pro today!
> >>>
> >>>
> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
> >>> _______________________________________________
> >>> Scikit-learn-general mailing list
> >>> Scikit-learn-general@lists.sourceforge.net
> >>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >>
> >>
> >>
> >>
> >>
> ------------------------------------------------------------------------------
> >> See everything from the browser to the database with AppDynamics
> >> Get end-to-end visibility with application monitoring from AppDynamics
> >> Isolate bottlenecks and diagnose root cause in seconds.
> >> Start your free trial of AppDynamics Pro today!
> >>
> >>
> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
> >> _______________________________________________
> >> Scikit-learn-general mailing list
> >> Scikit-learn-general@lists.sourceforge.net
> >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >>
> >
> >
> >
> ------------------------------------------------------------------------------
> > See everything from the browser to the database with AppDynamics
> > Get end-to-end visibility with application monitoring from AppDynamics
> > Isolate bottlenecks and diagnose root cause in seconds.
> > Start your free trial of AppDynamics Pro today!
> >
> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
> > _______________________________________________
> > Scikit-learn-general mailing list
> > Scikit-learn-general@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >
>
>
>
> --
> Ian Ozsvald (A.I. researcher)
> i...@ianozsvald.com
>
> http://IanOzsvald.com
> http://MorConsulting.com/
> http://Annotate.IO
> http://SocialTiesApp.com/
> http://TheScreencastingHandbook.com
> http://FivePoundApp.com/
> http://twitter.com/IanOzsvald
> http://ShowMeDo.com
>
>
> ------------------------------------------------------------------------------
> See everything from the browser to the database with AppDynamics
> Get end-to-end visibility with application monitoring from AppDynamics
> Isolate bottlenecks and diagnose root cause in seconds.
> Start your free trial of AppDynamics Pro today!
> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>

------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Improving Text Classification

Reply via email to