Alright! By removing the +1 the results seem much more legit. Also, the sublinear transformation makes sense. However, why use min_df=2 if I am worried about very common words?
-----Ursprüngliche Nachricht----- Von: Lars Buitinck [mailto:larsm...@gmail.com] Gesendet: Freitag, 29. November 2013 14:08 > I think this is done in the TfidfTransformer in this line: > > # avoid division by zeros for features that occur in all documents > > idf = np.log(float(n_samples) / df) + 1.0 > > Why is this specifically done? I thought the division by zero is > already covered by the smoothing. There seems to be no additional > division necessary from my understanding, because finally you only calculate > tf * idf. I think this is a workaround for a bug in a previous iteration of tfidf. You can try turning it off and maybe we should turn it off in master, or replace it with log(n_samples / (df + 1.)). Anyway, if you're worried about very common words, try setting min_df=2, and if you have a few long documents, try sublinear_tf=True. That replaces tf with 1 + log(tf) so repeated occurrences of a word get penalized. ------------------------------------------------------------------------------ Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general ------------------------------------------------------------------------------ Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general