Hi, 
I am wondering how exactly the tf-idfs are calculated in the TfidfTransformer 
or TfidfVectorizer since I can't really reproduce the results. It would be 
great if someone could help me a little bit here. E.g., let's consider those 2 
simple documents that can be transformed into a bag-of-words representation via 
the CountVectorizer:

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> count = CountVectorizer()
>>> docs = ['the sun is shining',  'the weather is nice',]
>>> bag = count.fit_transform(docs)
>>> bag.toarray()
array([[1, 0, 1, 1, 1, 0],
       [1, 1, 0, 0, 1, 1]], dtype=int64)

The following results are then produced by the TfidfTransformer using the 
default settings:

>>> from sklearn.feature_extraction.text import TfidfTransformer
>>> tfidf = TfidfTransformer(use_idf=True, norm='l2', smooth_idf=True)
>>> tfidf.fit_transform(bag).toarray()
[[ 0.40993715  0.          0.57615236  0.57615236  0.40993715  0.        ]
 [ 0.40993715  0.57615236  0.          0.          0.40993715  0.57615236]]

Now, let me "manually" l-2 normalize the term frequencies of the first document:

>>> tfs = [1, 0, 1, 1, 1, 0]
>>> l2 = np.sqrt(np.sum(([i**2 for i in tfs])))
>>> tfs_l2 = [(i / l2) for i in tfs]
>>> tfs_l2
[ 0.5,  0. ,  0.5,  0.5,  0.5,  0. ],

I get similar results like the TfidfTransformer with the disabled use_idf, so 
far so good...

>>> TfidfTransformer(use_idf=False, norm='l2', 
>>> smooth_idf=False).fit_transform(bag).toarray()
array([[ 0.5,  0. ,  0.5,  0.5,  0.5,  0. ],
       [ 0.5,  0.5,  0. ,  0. ,  0.5,  0.5]])

Next the idfs:

The default equation is:
# idf = log ( number_of_docs / number_of_docs_where_term_appears )

And in the online documentation at
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html
 
I found the additional info:
> smooth_idf : boolean, default=True
> Smooth idf weights by adding one to document frequencies, as if an extra 
> document was seen containing every term in the collection exactly once. 
> Prevents zero divisions.


So that I assume that the smooth_idf is calculated as follows:
# smooth_idf = log ( number_of_docs / (1 + number_of_docs_where_term_appears) )

Also, on the same page, I read:
> The actual formula used for tf-idf is tf * (idf + 1) = tf + tf * idf, instead 
> of tf * idf.

However, when I calculate the tfidf based on the smooth_idf, I get slightly 
different results as shown below:

For the 1st term frequency in the 1st vector:
>>> smooth_idf_0 = np.log10(2 / (1+2))
>>> 0.5 * (smooth_idf_0+1)
0.41195
#  0.4099 via the tfidf transformer

For the 2nd term frequency in the 1st vector:
>>> smooth_idf_1 = np.log10(2 / (1+1))
>>> 0.0 * (smooth_idf_1+1)
0.0
#  0.0 via the tfidf transformer

For the 3rd term frequency in the 1st vector:
smooth_idf_2 = np.log10(2 / (1+1))
>>> 0.5 * (smooth_idf_2+1)
0.5
# 0.5761 via the tfidf transformer

I tried different things (natural log, smooth_idf=False, etc.) but couldn't 
reproduce that are returned from the TfidfTransformer
I hope someone can help me with this!

PS: Maybe a toy example can be added to the documentation in future for clarity.

Best,
Sebastian


------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to