Hi,
I am trying to use TFIDF weighting to extract significant keywords from
a corpus of texts (and later to compute cosine similarity between texts).
For testing purposes, I am not doing any stopword filtering prior to
vectorizing my data. I am consistently getting unexpected results, with
the stopwords getting the highest TFIDF weights.
I realized that my stopwords were getting an IDF weight of 1, where, as
I understand, they should get something very close to 0 since they are
for the most part present in every single document. Is the IDF formula
used by the TfidfTransformer correct?
# avoid division by zeros for features that occur in all documents
self.idf_ = np.log(float(n_samples) / df) + 1.0
Shouldn't it be as follows, with the smoothing occurring at the
denominator level, to avoid a potential zero division error?
>>> self.idf_ = np.log(float(n_samples) / df + 1.0)
Thanks in advance for your help,
Frédérique
------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and
threat landscape has changed and how IT managers can respond. Discussions
will include endpoint security, mobile security and the latest in malware
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general