2015-05-22 8:29 GMT+02:00 Sebastian Raschka <se.rasc...@gmail.com>:
> The default equation is:
> # idf = log ( number_of_docs / number_of_docs_where_term_appears )
>
> And in the online documentation at
> http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html
> I found the additional info:
>> smooth_idf : boolean, default=True
>> Smooth idf weights by adding one to document frequencies, as if an extra 
>> document was seen containing every term in the collection exactly once. 
>> Prevents zero divisions.
>
>
> So that I assume that the smooth_idf is calculated as follows:
> # smooth_idf = log ( number_of_docs / (1 + number_of_docs_where_term_appears) 
> )

I don't have a full answer ready, but note that number_of_docs must
also be incremented by the smoothing term (which is actually a
misnomer, IIRC). Otherwise the logs can come out negative.

Logs are also always natural logs in scikit-learn.

HTH

------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to