On 29 November 2013 14:43, Olivier Grisel <olivier.gri...@ensta.org> wrote:

> 2013/11/29 Andreas Hjortgaard Danielsen <andrea...@gmail.com>:
> > Hi,
> >
> > It might be worth noting that Lucene uses the same implementation:
> >
> http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
>
> Same as what? The current master or @larsmans' suggested fix?
>

Lucene uses the same implementation as the current master (according to
their documentation):

idf = 1 + log(N / (df + 1))

So this could indicate that there is a reason for the added one.
In the case where df = N (the term shows up in every document), the
standard IDF would become negative:

log(N / (N+1)) < 0

which would give negative TF-IDF values. I think the +1 is there to avoid
that.


Best,
Andreas



>
> > And Gensim has an option for choosing an addition constant (although the
> > default is 0).
> >
> https://github.com/piskvorky/gensim/blob/develop/gensim/models/tfidfmodel.py
> >
> > Could this be some numerical trick?
>
> Honestly I don't remember well how we ended up in the current
> implementation. I just remember that we had introduced bugs at some
> points (negative values and zero division error). The current state
> might still be buggy in some respect as the last bugfix change might
> not be the "correct" way to do it.
>
> --
> Olivier
>
>
> ------------------------------------------------------------------------------
> Rapidly troubleshoot problems before they affect your business. Most IT
> organizations don't have a clear picture of how application performance
> affects their revenue. With AppDynamics, you get 100% visibility into your
> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics
> Pro!
> http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to