Hi there,

 

I am currently working with the TfidfVectorizer provided by scikit learn.
However, I just came up with a problem/question. In my case I have around 20
very long documents. Some terms in these documents occur much, much more
frequently than others. From my pure intuition, these terms should get
penalized heavily (close to zero) with the Tfidf procedure.

 

Nevertheless, when I look up the top tfidf terms for each document, such
high frequent terms are on the top of the list even though they occur in
each single document. I took a deeper look into the specific values, and it
appears that all these terms - which occur in _every_ document - receive idf
values of 1. However, shouldn't these be zero? Because if they are one, the
extreme high frequency (tf) counts overrule the aspect that idf should
provide, and rank them to the top.

 

I think this is done in the TfidfTransformer in this line:

# avoid division by zeros for features that occur in all documents

idf = np.log(float(n_samples) / df) + 1.0

 

Why is this specifically done? I thought the division by zero is already
covered by the smoothing. There seems to be no additional division necessary
from my understanding, because finally you only calculate tf * idf.

 

Hope someone can help me out.

 

Cheers,

Philipp

 

------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to