[Scikit-learn-general] IDF formula

Frédérique Passot Tue, 03 Jul 2012 15:13:50 -0700

Hi,

I am trying to use TFIDF weighting to extract significant keywords from 
a corpus of texts (and later to compute cosine similarity between texts).


For testing purposes, I am not doing any stopword filtering prior to 
vectorizing my data. I am consistently getting unexpected results, with 
the stopwords getting the highest TFIDF weights.

I realized that my stopwords were getting an IDF weight of 1, where, as 
I understand, they should get something very close to 0 since they are 
for the most part present in every single document. Is the IDF formula 
used by the TfidfTransformer correct?

# avoid division by zeros for features that occur in all documents
             self.idf_ = np.log(float(n_samples) / df) + 1.0

Shouldn't it be as follows, with the smoothing occurring at the 
denominator level, to avoid a potential zero division error?
 >>> self.idf_ = np.log(float(n_samples) / df + 1.0)

Thanks in advance for your help,
        Frédérique

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

[Scikit-learn-general] IDF formula

Reply via email to