> Anyway, if you're worried about very common words, try setting min_df=2, and if you have a few long documents, try sublinear_tf=True. That replaces tf with 1 + log(tf) so repeated occurrences of a word get penalized.
To trim words that occur more than 90% of the time, `max_df=0.9` works great too. But if want to further investigate and fix the TF-IDF weighting, please feel free to open a PR. But then please check the impact on clustering though (for instance running http://scikit-learn.org/stable/auto_examples/document_clustering.html several times). Last time I tried to implement the "correct" TF-IDF formula, it decreased the stability of the text clustering example significantly without improving the performance on supervised classification tasks (e.g. as in http://scikit-learn.org/stable/auto_examples/document_classification_20newsgroups.html ). -- Olivier ------------------------------------------------------------------------------ Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general