> Anyway, if you're worried about very common words, try setting
min_df=2, and if you have a few long documents, try sublinear_tf=True.
That replaces tf with 1 + log(tf) so repeated occurrences of a word
get penalized.

To trim words that occur more than 90% of the time, `max_df=0.9` works
great too.

But if want to further investigate and fix the TF-IDF weighting,
please feel free to open a PR. But then please check the impact on
clustering though (for instance running
http://scikit-learn.org/stable/auto_examples/document_clustering.html
several times). Last time I tried to implement the "correct" TF-IDF
formula, it decreased the stability of the text clustering example
significantly without improving the performance on supervised
classification tasks (e.g. as in
http://scikit-learn.org/stable/auto_examples/document_classification_20newsgroups.html
).

-- 
Olivier

------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to