I have been using Scikit's text classification for several weeks, and I really 
like it.  I use my own corpus (self-generated) and prepare each document using 
the NLTK.  Presently I am relying on this tutorial/code-base, only making 
changes when absolutely necessary for my documents to work.

The problem: when working with smaller document sets (tens or one-hundred 
instead of hundreds or thousands) I am receiving really low success rates (the 
f1-score). 

Question: What are the areas/options I should be researching and attempting to 
implement to improve my categorization success rates?

Additional information: using the NLTK, I am removing stopwords and stemming 
each word before the document is classified.  Using Scikit, I am using an 
n-gram range of one to four (this made a significant difference when I had 
larger sample sets, but very little difference now that I am working on 
"sub-categories").

Any help would be enormously appreciated.  I am willing to research and try any 
technique recommended, so if you have one, please don't hesitate.  Thank you 
very much.

Warmest regards,

Mike
------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to