2013/11/29 Philipp Singer <[email protected]>: > Nevertheless, when I look up the top tfidf terms for each document, such > high frequent terms are on the top of the list even though they occur in > each single document. I took a deeper look into the specific values, and it > appears that all these terms – which occur in _every_ document – receive idf > values of 1. However, shouldn’t these be zero? Because if they are one, the > extreme high frequency (tf) counts overrule the aspect that idf should > provide, and rank them to the top.
Yes, they should be zero if they really occur in all documents. > I think this is done in the TfidfTransformer in this line: > > # avoid division by zeros for features that occur in all documents > > idf = np.log(float(n_samples) / df) + 1.0 > > Why is this specifically done? I thought the division by zero is already > covered by the smoothing. There seems to be no additional division necessary > from my understanding, because finally you only calculate tf * idf. I think this is a workaround for a bug in a previous iteration of tfidf. You can try turning it off and maybe we should turn it off in master, or replace it with log(n_samples / (df + 1.)). Anyway, if you're worried about very common words, try setting min_df=2, and if you have a few long documents, try sublinear_tf=True. That replaces tf with 1 + log(tf) so repeated occurrences of a word get penalized. ------------------------------------------------------------------------------ Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
