Okay, I figured it out... there is a typo in the documentation:
Instead of tfidf = tf * (idf + 1) as described in the documentation at
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html
it is actually:
tfidf = tf * idf + 1.0, where idf = np.log(len(n_docs) / df)
Example:
import numpy as np
docs = np.array([
'The sun is shining',
'The weather is sweet',
'The sun is shining and the weather is sweet'])
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
tf = cv.fit_transform(docs).toarray()
tf
array([[0, 1, 1, 1, 0, 1, 0],
[0, 1, 0, 0, 1, 1, 1],
[1, 2, 1, 1, 1, 2, 1]], dtype=int64)
cv.vocabulary_
{'and': 0, 'is': 1, 'shining': 2, 'sun': 3, 'sweet': 4, 'the': 5, 'weather': 6}
tfidf = TfidfTransformer(use_idf=True, smooth_idf=False, norm=None)
tfidf.fit_transform(tf).toarray()
array([[ 0. , 1. , 1.40546511, 1.40546511, 0. ,
1. , 0. ],
[ 0. , 1. , 0. , 0. , 1.40546511,
1. , 1.40546511],
[ 2.09861229, 2. , 1.40546511, 1.40546511, 1.40546511,
2. , 1.40546511]])
Now, calculating the tf-idf of the first 3 words in the vocabulary "and",
"shining", and" is document 3:
tf_and = 1
df_and = 1
tf_and * (np.log(len(docs) / df_and) + 1.0)
2.09861228866811
tf_shining = 2
df_shining = 3
tf_shining * (np.log(len(docs) / df_shining) + 1.0)
2.0
tf_is = 1
df_is = 2
tf_is * (np.log(len(docs) / df_is) + 1.0)
1.4054651081081644
> On May 22, 2015, at 5:15 PM, Sebastian Raschka <[email protected]> wrote:
>
> Thanks, Lars, that's what I thought (natural log). I will try some more
> combinations later and browse through the source code to see if I can somehow
> manage to reproduce the results. Maybe it would be good to write it up as an
> example then for the documentation -- in case someone else is wondering about
> it since it is slightly different from the "classic" tf-idf approach.
>
> Btw. is there anything that speaks against those negative values in the
> feature vectors? I mean for e.g., SGD classifiers it can maybe be beneficial
> to have values that can be positive and negative.
>
> Best,
> Sebastian
>
>
>> On May 22, 2015, at 12:00 PM, Lars Buitinck <[email protected]> wrote:
>>
>> 2015-05-22 8:29 GMT+02:00 Sebastian Raschka <[email protected]>:
>>> The default equation is:
>>> # idf = log ( number_of_docs / number_of_docs_where_term_appears )
>>>
>>> And in the online documentation at
>>> http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html
>>> I found the additional info:
>>>> smooth_idf : boolean, default=True
>>>> Smooth idf weights by adding one to document frequencies, as if an extra
>>>> document was seen containing every term in the collection exactly once.
>>>> Prevents zero divisions.
>>>
>>>
>>> So that I assume that the smooth_idf is calculated as follows:
>>> # smooth_idf = log ( number_of_docs / (1 +
>>> number_of_docs_where_term_appears) )
>>
>> I don't have a full answer ready, but note that number_of_docs must
>> also be incremented by the smoothing term (which is actually a
>> misnomer, IIRC). Otherwise the logs can come out negative.
>>
>> Logs are also always natural logs in scikit-learn.
>>
>> HTH
>>
>> ------------------------------------------------------------------------------
>> One dashboard for servers and applications across Physical-Virtual-Cloud
>> Widest out-of-the-box monitoring support with 50+ applications
>> Performance metrics, stats and reports that give you Actionable Insights
>> Deep dive visibility with transaction tracing using APM Insight.
>> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
> ------------------------------------------------------------------------------
> One dashboard for servers and applications across Physical-Virtual-Cloud
> Widest out-of-the-box monitoring support with 50+ applications
> Performance metrics, stats and reports that give you Actionable Insights
> Deep dive visibility with transaction tracing using APM Insight.
> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general