Hi,
It might be worth noting that Lucene uses the same implementation:
http://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
And Gensim has an option for choosing an addition constant (although the
default is 0).
https://github.com/piskvorky/gensim/blob/develop/gensim/models/tfidfmodel.py
Could this be some numerical trick?
Best,
Andreas
On 29 November 2013 14:27, Olivier Grisel <olivier.gri...@ensta.org> wrote:
> > Anyway, if you're worried about very common words, try setting
> min_df=2, and if you have a few long documents, try sublinear_tf=True.
> That replaces tf with 1 + log(tf) so repeated occurrences of a word
> get penalized.
>
> To trim words that occur more than 90% of the time, `max_df=0.9` works
> great too.
>
> But if want to further investigate and fix the TF-IDF weighting,
> please feel free to open a PR. But then please check the impact on
> clustering though (for instance running
> http://scikit-learn.org/stable/auto_examples/document_clustering.html
> several times). Last time I tried to implement the "correct" TF-IDF
> formula, it decreased the stability of the text clustering example
> significantly without improving the performance on supervised
> classification tasks (e.g. as in
>
> http://scikit-learn.org/stable/auto_examples/document_classification_20newsgroups.html
> ).
>
> --
> Olivier
>
>
> ------------------------------------------------------------------------------
> Rapidly troubleshoot problems before they affect your business. Most IT
> organizations don't have a clear picture of how application performance
> affects their revenue. With AppDynamics, you get 100% visibility into your
> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics
> Pro!
> http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT
organizations don't have a clear picture of how application performance
affects their revenue. With AppDynamics, you get 100% visibility into your
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general