Hi Harold. Are you using different models for the different types of social media? I'd guess that the grammar/terms used in a tweet could look quite different to what you see in e.g. a Google+ Comment (different demographic->probably higher quality English, less space restrictions->longer/clearer wording more likely). I'd guess that FB/G+ might have similar text, YouTube is specific to video ratings (and has a wide demographic with varying qualities of English), Tweets are short and so pack a lot of context into few characters.
Re. labelled data - yes, get lots of it. More data probably beats all the other tweaks you could come up with. I have millions of tweets here for 'apple', I'm about to experiment with building models based on rule-derived classifications (e.g. if 'iphone' then is-Apple-Inc, if 'apple juice' then is-not-Apple-inc). This, possibly with some semi-supervised classification and human intervention, could be a powerful mechanism. The alternative is to employ an intern/Mechanical Turk/CrowdFlower to get cheap human-hours behind the classification task (maybe e.g. using 3 human votes per message to average out mistakes/foolishness). I've looked into this before but haven't tried it in anger yet. i. On 11 July 2013 17:34, Harold Nguyen <har...@nexgate.com> wrote: > Hi Ian, > > Thank you very much for writing this message, and especially for > sharing your experience. I am actually > doing the very same thing, and would love to collaborate with you, > if possible. I'm not as far along in my journey as you are, but I hope > we can help each other in the future! > > I'm categorizing (>2 categories) tweets, facebook comments, youtube > comments, > google+ comments, etc... > > The biggest problem is not having enough labeled data. As others have > mentioned, > it seems like a great idea to spend resources into obtaining this labeled > data, but I also like > the idea of "automatic text classification," which is performed in a > semi-supervised way. > > The idea is that, having a few labeled data, one can do clustering of > documents of unlabeled data > to expand the amount of data in the same category. I haven't gotten far with > this yet, but once I have, > would like to circle back with you later if I learn anything interesting. > > Thanks and best wishes, > > Harold > > > On Thu, Jul 11, 2013 at 2:43 AM, Ian Ozsvald <i...@ianozsvald.com> wrote: >> >> Hello Mike. Could you give a summary of your problem? It sounds like >> you're categorising text (tweets? medical text? news articles?) into >> >2 categories (how many?), is that right? Is the goal really to >> optimise your f1 score, or maybe to only want accurate categorisations >> (precision) or maybe high recall? >> >> I'm working on a related problem for "brand disambiguation in social >> media" - basically I'm looking to disambiguate words like "apple" in >> tweets for brand analysis, I want all the Apple Inc mentions and none >> for fruit, juice, bands, pejorative, fashion etc. This is similar to >> spam classification (in-class, out-of-class). The project is open >> (I've been on this task for 5 weeks as a break from consulting work), >> I collected and tagged 2000 tweets for this first iteration. Code and >> write-ups here: >> https://github.com/ianozsvald/social_media_brand_disambiguator >> http://ianozsvald.com/category/socialmediabranddisambiguator/ >> >> A very related question to my project was asked on StackOverflow, you >> might find the advice given by others (I also wrote a long answer) >> relevant: >> >> http://stackoverflow.com/questions/17352469/how-can-i-build-a-model-to-distinguish-tweets-about-apple-inc-from-tweets-abo/17502141 >> >> With my tweets the messages are roughly of uniform length (so TF IDF >> isn't so useful) and often have only 1 mention of a term (so Binary >> Term Counts make more sense). Binomial NaiveBayes outperforms >> everything else (Logistic Regression comes second). Currently I don't >> do any cleaning or stemming, unigrams are 'ok' but 1-3 ngrams are >> best. >> >> I've haven't experimented yet with Chi2 for feature selection (beyond >> just eyeballing the output to see what was important). I'm not yet >> adding any external information (e.g. no part of speech tags, no >> retrieving of titles of URLs in a tweet). >> >> Using 5-fold cross validation I get pretty stable results, I'm >> optimising for precision and then recall (I want correct answers, and >> as many as possible). I test using a precision score and cross entropy >> error (the latter is useful for fine tweaks but isn't comparable >> between classifiers - the probability distributions can be very >> different). The result is competitive with OpenCalais (a commercial >> alternative) on my training sets using a held-out validation set, I >> don't yet know if it generalises well to other time periods (that's >> next on my list to test). >> >> For diagnostics I've found a plain DecisionTree to be useful - it is >> obvious in my trees that two features are being overfitted. First if >> 'http' appears then it is much more likely to be an Apple Inc tweet - >> this could be an artefact of the tweets in my sample period (lots of >> news stories). Secondly there are Vatican/Pope references that give >> evidence to Apple Inc - these tweets came from around the time of the >> new Pope announcement (e.g. jokes about "iPope" and "white smoke from >> Apple HQ"). These features pop out near the top of the Decision Tree >> and clearly look wrong. This might help with your diagnostics? >> >> I'm also diagnosing the tweets that are misclassified - I invert them >> through the Vectorizer to get the bag of words back. Often the poorly >> classified tweets have few terms which are not informative. This also >> might help with diagnosing your poor performance? >> >> I'd be happy to hear more about your problem as it has obvious >> similarities to mine, >> Ian. >> >> On 10 July 2013 22:26, Mike Hansen <mikehanse...@ymail.com> wrote: >> > I have been using Scikit's text classification for several weeks, and I >> > really like it. I use my own corpus (self-generated) and prepare each >> > document using the NLTK. Presently I am relying on this >> > tutorial/code-base, >> > only making changes when absolutely necessary for my documents to work. >> > >> > The problem: when working with smaller document sets (tens or >> > one-hundred >> > instead of hundreds or thousands) I am receiving really low success >> > rates >> > (the f1-score). >> > >> > Question: What are the areas/options I should be researching and >> > attempting >> > to implement to improve my categorization success rates? >> > >> > Additional information: using the NLTK, I am removing stopwords and >> > stemming >> > each word before the document is classified. Using Scikit, I am using >> > an >> > n-gram range of one to four (this made a significant difference when I >> > had >> > larger sample sets, but very little difference now that I am working on >> > "sub-categories"). >> > >> > Any help would be enormously appreciated. I am willing to research and >> > try >> > any technique recommended, so if you have one, please don't hesitate. >> > Thank >> > you very much. >> > >> > Warmest regards, >> > >> > Mike >> > >> > >> > ------------------------------------------------------------------------------ >> > See everything from the browser to the database with AppDynamics >> > Get end-to-end visibility with application monitoring from AppDynamics >> > Isolate bottlenecks and diagnose root cause in seconds. >> > Start your free trial of AppDynamics Pro today! >> > >> > http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk >> > _______________________________________________ >> > Scikit-learn-general mailing list >> > Scikit-learn-general@lists.sourceforge.net >> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >> > >> >> >> >> -- >> Ian Ozsvald (A.I. researcher) >> i...@ianozsvald.com >> >> http://IanOzsvald.com >> http://MorConsulting.com/ >> http://Annotate.IO >> http://SocialTiesApp.com/ >> http://TheScreencastingHandbook.com >> http://FivePoundApp.com/ >> http://twitter.com/IanOzsvald >> http://ShowMeDo.com >> >> >> ------------------------------------------------------------------------------ >> See everything from the browser to the database with AppDynamics >> Get end-to-end visibility with application monitoring from AppDynamics >> Isolate bottlenecks and diagnose root cause in seconds. >> Start your free trial of AppDynamics Pro today! >> >> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk >> _______________________________________________ >> Scikit-learn-general mailing list >> Scikit-learn-general@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > -- Ian Ozsvald (A.I. researcher) i...@ianozsvald.com http://IanOzsvald.com http://MorConsulting.com/ http://Annotate.IO http://SocialTiesApp.com/ http://TheScreencastingHandbook.com http://FivePoundApp.com/ http://twitter.com/IanOzsvald http://ShowMeDo.com ------------------------------------------------------------------------------ See everything from the browser to the database with AppDynamics Get end-to-end visibility with application monitoring from AppDynamics Isolate bottlenecks and diagnose root cause in seconds. Start your free trial of AppDynamics Pro today! http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general