2013/11/29 Olivier Grisel <olivier.gri...@ensta.org>:
> 2013/11/29 Andreas Hjortgaard Danielsen <andrea...@gmail.com>:
>> Hi,
>>
>> It might be worth noting that Lucene uses the same implementation:
>> http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
>
> Same as what? The current master or @larsmans' suggested fix?

Actually it seems to be adding one in both places, as Andreas already
said. But then Lucene plays more tricks, such as squaring idf.

I suspect Lucene does this to handle the corner case where the query
contains only words that occur in all documents (and maybe a few that
don't occur anywhere). If it didn't, the document vectors would be all
zero and so would cosine similarity, meaning an empty result set
despite query-document term overlap.

> Honestly I don't remember well how we ended up in the current
> implementation. I just remember that we had introduced bugs at some
> points (negative values and zero division error). The current state
> might still be buggy in some respect as the last bugfix change might
> not be the "correct" way to do it.

There is no "correct" way to do tfidf. It's a hack, with little
theoretical background and with no universally accepted definition.
The textbook idf = log(n_samples / df) can be regarded as a measure of
the information in a term (a term that occurs in all documents carries
zero information and has zero idf) but multiplying that with tf is
purely a heuristic, AFAIK.

------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to