2013/11/29 Philipp Singer <[email protected]>:
> Nevertheless, when I look up the top tfidf terms for each document, such
> high frequent terms are on the top of the list even though they occur in
> each single document. I took a deeper look into the specific values, and it
> appears that all these terms – which occur in _every_ document – receive idf
> values of 1. However, shouldn’t these be zero? Because if they are one, the
> extreme high frequency (tf) counts overrule the aspect that idf should
> provide, and rank them to the top.

Yes, they should be zero if they really occur in all documents.

> I think this is done in the TfidfTransformer in this line:
>
> # avoid division by zeros for features that occur in all documents
>
> idf = np.log(float(n_samples) / df) + 1.0
>
> Why is this specifically done? I thought the division by zero is already
> covered by the smoothing. There seems to be no additional division necessary
> from my understanding, because finally you only calculate tf * idf.

I think this is a workaround for a bug in a previous iteration of
tfidf. You can try turning it off and maybe we should turn it off in
master, or replace it with log(n_samples / (df + 1.)).

Anyway, if you're worried about very common words, try setting
min_df=2, and if you have a few long documents, try sublinear_tf=True.
That replaces tf with 1 + log(tf) so repeated occurrences of a word
get penalized.

------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to