Alright! By removing the +1 the results seem much more legit.

Also, the sublinear transformation makes sense. However, why use min_df=2 if I 
am worried about very common words?

-----Ursprüngliche Nachricht-----
Von: Lars Buitinck [mailto:larsm...@gmail.com] 
Gesendet: Freitag, 29. November 2013 14:08


> I think this is done in the TfidfTransformer in this line:
>
> # avoid division by zeros for features that occur in all documents
>
> idf = np.log(float(n_samples) / df) + 1.0
>
> Why is this specifically done? I thought the division by zero is 
> already covered by the smoothing. There seems to be no additional 
> division necessary from my understanding, because finally you only calculate 
> tf * idf.

I think this is a workaround for a bug in a previous iteration of tfidf. You 
can try turning it off and maybe we should turn it off in master, or replace it 
with log(n_samples / (df + 1.)).

Anyway, if you're worried about very common words, try setting min_df=2, and if 
you have a few long documents, try sublinear_tf=True.
That replaces tf with 1 + log(tf) so repeated occurrences of a word get 
penalized.

------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance affects 
their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, & 
PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to