Okay, I got it now and put a short notebook together for personal reference if 
someone is interested:

https://github.com/rasbt/pattern_classification/blob/master/machine_learning/scikit-learn/tfidf_scikit-learn.ipynb


> On May 23, 2015, at 4:39 PM, Sebastian Raschka <se.rasc...@gmail.com> wrote:
> 
> Oh, sorry, never-mind my last mail.
> 
> 
>> On May 22, 2015, at 5:15 PM, Sebastian Raschka <se.rasc...@gmail.com> wrote:
>> 
>> Thanks, Lars, that's what I thought (natural log). I will try some more 
>> combinations later and browse through the source code to see if I can 
>> somehow manage to reproduce the results. Maybe it would be good to write it 
>> up as an example then for the documentation -- in case someone else is 
>> wondering about it since it is slightly different from the "classic" tf-idf 
>> approach.
>> 
>> Btw. is there anything that speaks against those negative values in the 
>> feature vectors? I mean for e.g., SGD classifiers it can maybe be beneficial 
>> to have values that can be positive and negative.
>> 
>> Best,
>> Sebastian
>> 
>> 
>>> On May 22, 2015, at 12:00 PM, Lars Buitinck <larsm...@gmail.com> wrote:
>>> 
>>> 2015-05-22 8:29 GMT+02:00 Sebastian Raschka <se.rasc...@gmail.com>:
>>>> The default equation is:
>>>> # idf = log ( number_of_docs / number_of_docs_where_term_appears )
>>>> 
>>>> And in the online documentation at
>>>> http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html
>>>> I found the additional info:
>>>>> smooth_idf : boolean, default=True
>>>>> Smooth idf weights by adding one to document frequencies, as if an extra 
>>>>> document was seen containing every term in the collection exactly once. 
>>>>> Prevents zero divisions.
>>>> 
>>>> 
>>>> So that I assume that the smooth_idf is calculated as follows:
>>>> # smooth_idf = log ( number_of_docs / (1 + 
>>>> number_of_docs_where_term_appears) )
>>> 
>>> I don't have a full answer ready, but note that number_of_docs must
>>> also be incremented by the smoothing term (which is actually a
>>> misnomer, IIRC). Otherwise the logs can come out negative.
>>> 
>>> Logs are also always natural logs in scikit-learn.
>>> 
>>> HTH
>>> 
>>> ------------------------------------------------------------------------------
>>> One dashboard for servers and applications across Physical-Virtual-Cloud 
>>> Widest out-of-the-box monitoring support with 50+ applications
>>> Performance metrics, stats and reports that give you Actionable Insights
>>> Deep dive visibility with transaction tracing using APM Insight.
>>> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> Scikit-learn-general@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>> 
>> 
>> ------------------------------------------------------------------------------
>> One dashboard for servers and applications across Physical-Virtual-Cloud 
>> Widest out-of-the-box monitoring support with 50+ applications
>> Performance metrics, stats and reports that give you Actionable Insights
>> Deep dive visibility with transaction tracing using APM Insight.
>> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> 


------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to