2013/7/10 Mike Hansen <mikehanse...@ymail.com>:
> I have been using Scikit's text classification for several weeks, and I
> really like it.  I use my own corpus (self-generated) and prepare each
> document using the NLTK.  Presently I am relying on this tutorial/code-base,
> only making changes when absolutely necessary for my documents to work.
>
> The problem: when working with smaller document sets (tens or one-hundred
> instead of hundreds or thousands) I am receiving really low success rates
> (the f1-score).
>
> Question: What are the areas/options I should be researching and attempting
> to implement to improve my categorization success rates?
>
> Additional information: using the NLTK, I am removing stopwords and stemming
> each word before the document is classified.  Using Scikit, I am using an
> n-gram range of one to four (this made a significant difference when I had
> larger sample sets, but very little difference now that I am working on
> "sub-categories").
>
> Any help would be enormously appreciated.  I am willing to research and try
> any technique recommended, so if you have one, please don't hesitate.  Thank
> you very much.

As you have very few labeled samples you should reduce the complexity
of the classification model by reducing the number of potentially
noisy features:

- stick to individual tokens instead of n-grams,
- try max_df=0.9 or 0.8 and min_df=2 or 3
- use chi2 feature selection to select the top 1000 features or less
- try binary features instead of TF-IDF features
- try Multinomial Naive Bayes instead of penalized linear classifiers

Perform 10 fold cross validation for the grid search of the parameters
of the whole pipeline.

Also, rather than try wasting to much time trying to come up with a
complex model you should invest effort in collecting more labeled
samples, possibly in a semi supervised way (perform google queries /
fulltext on interesting keywords on the web / a bigger unlabeled
corpus).

--
Olivier

------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to