2012/7/4 Frédérique Passot <[email protected]>: > Hi, > > I am trying to use TFIDF weighting to extract significant keywords from > a corpus of texts (and later to compute cosine similarity between texts). > > For testing purposes, I am not doing any stopword filtering prior to > vectorizing my data. I am consistently getting unexpected results, with > the stopwords getting the highest TFIDF weights. > > I realized that my stopwords were getting an IDF weight of 1, where, as > I understand, they should get something very close to 0 since they are > for the most part present in every single document. Is the IDF formula > used by the TfidfTransformer correct?
Indeed, this is not a canonical formula but it makes the document clustering example work better for some reason. This is still on my TODO list to try an understand why... Maybe this is because cluster TF-IDF should be done with cosine similarity instead of the euclidean distance as currently done. > # avoid division by zeros for features that occur in all documents > self.idf_ = np.log(float(n_samples) / df) + 1.0 > > Shouldn't it be as follows, with the smoothing occurring at the > denominator level, to avoid a potential zero division error? > >>> self.idf_ = np.log(float(n_samples) / df + 1.0) The comment is wrong. n_samples has already received a smoothing increment on the previous line. This 1.0 is there to avoid the idf_ being exactly zero for feature that are active on only one sample which would remove some features from the input space and that might be hard to debug artifact for the user: try to remove the +1 and have a look at the test that break. The breakage of that test is very counter-intuitive when it happens IMHO. We should still find a way to implement more standard IDF smoothing that does not have this artifact. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ Live Security Virtual Conference Exclusive live event will cover all the ways today's security and threat landscape has changed and how IT managers can respond. Discussions will include endpoint security, mobile security and the latest in malware threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/ _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
