I have been using Scikit's text classification for several weeks, and I really
like it. I use my own corpus (self-generated) and prepare each document using
the NLTK. Presently I am relying on this tutorial/code-base, only making
changes when absolutely necessary for my documents to work.
The problem: when working with smaller document sets (tens or one-hundred
instead of hundreds or thousands) I am receiving really low success rates (the
f1-score).
Question: What are the areas/options I should be researching and attempting to
implement to improve my categorization success rates?
Additional information: using the NLTK, I am removing stopwords and stemming
each word before the document is classified. Using Scikit, I am using an
n-gram range of one to four (this made a significant difference when I had
larger sample sets, but very little difference now that I am working on
"sub-categories").
Any help would be enormously appreciated. I am willing to research and try any
technique recommended, so if you have one, please don't hesitate. Thank you
very much.
Warmest regards,
Mike
------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general