2012/7/4 Frédérique Passot <[email protected]>:
> Hi,
>
> I am trying to use TFIDF weighting to extract significant keywords from
> a corpus of texts (and later to compute cosine similarity between texts).
>
> For testing purposes, I am not doing any stopword filtering prior to
> vectorizing my data. I am consistently getting unexpected results, with
> the stopwords getting the highest TFIDF weights.
>
> I realized that my stopwords were getting an IDF weight of 1, where, as
> I understand, they should get something very close to 0 since they are
> for the most part present in every single document. Is the IDF formula
> used by the TfidfTransformer correct?

Indeed, this is not a canonical formula but it makes the document
clustering example work better for some reason. This is still on my
TODO list to try an understand why...

Maybe this is because cluster TF-IDF should be done with cosine
similarity instead of the euclidean distance as currently done.

> # avoid division by zeros for features that occur in all documents
>              self.idf_ = np.log(float(n_samples) / df) + 1.0
>
> Shouldn't it be as follows, with the smoothing occurring at the
> denominator level, to avoid a potential zero division error?
>  >>> self.idf_ = np.log(float(n_samples) / df + 1.0)

The comment is wrong. n_samples has already received a smoothing
increment on the previous line. This 1.0 is there to avoid the idf_
being exactly zero for feature that are active on only one sample
which would remove some features from the input space and that might
be hard to debug artifact for the user: try to remove the +1 and have
a look at the test that break. The breakage of that test is very
counter-intuitive when it happens IMHO.

We should still find a way to implement more standard IDF smoothing
that does not have this artifact.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to