2013/11/29 Olivier Grisel <olivier.gri...@ensta.org>: > 2013/11/29 Andreas Hjortgaard Danielsen <andrea...@gmail.com>: >> Hi, >> >> It might be worth noting that Lucene uses the same implementation: >> http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html > > Same as what? The current master or @larsmans' suggested fix?
Actually it seems to be adding one in both places, as Andreas already said. But then Lucene plays more tricks, such as squaring idf. I suspect Lucene does this to handle the corner case where the query contains only words that occur in all documents (and maybe a few that don't occur anywhere). If it didn't, the document vectors would be all zero and so would cosine similarity, meaning an empty result set despite query-document term overlap. > Honestly I don't remember well how we ended up in the current > implementation. I just remember that we had introduced bugs at some > points (negative values and zero division error). The current state > might still be buggy in some respect as the last bugfix change might > not be the "correct" way to do it. There is no "correct" way to do tfidf. It's a hack, with little theoretical background and with no universally accepted definition. The textbook idf = log(n_samples / df) can be regarded as a measure of the information in a term (a term that occurs in all documents carries zero information and has zero idf) but multiplying that with tf is purely a heuristic, AFAIK. ------------------------------------------------------------------------------ Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general