Hi,
I am wondering how exactly the tf-idfs are calculated in the TfidfTransformer
or TfidfVectorizer since I can't really reproduce the results. It would be
great if someone could help me a little bit here. E.g., let's consider those 2
simple documents that can be transformed into a bag-of-words representation via
the CountVectorizer:
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> count = CountVectorizer()
>>> docs = ['the sun is shining', 'the weather is nice',]
>>> bag = count.fit_transform(docs)
>>> bag.toarray()
array([[1, 0, 1, 1, 1, 0],
[1, 1, 0, 0, 1, 1]], dtype=int64)
The following results are then produced by the TfidfTransformer using the
default settings:
>>> from sklearn.feature_extraction.text import TfidfTransformer
>>> tfidf = TfidfTransformer(use_idf=True, norm='l2', smooth_idf=True)
>>> tfidf.fit_transform(bag).toarray()
[[ 0.40993715 0. 0.57615236 0.57615236 0.40993715 0. ]
[ 0.40993715 0.57615236 0. 0. 0.40993715 0.57615236]]
Now, let me "manually" l-2 normalize the term frequencies of the first document:
>>> tfs = [1, 0, 1, 1, 1, 0]
>>> l2 = np.sqrt(np.sum(([i**2 for i in tfs])))
>>> tfs_l2 = [(i / l2) for i in tfs]
>>> tfs_l2
[ 0.5, 0. , 0.5, 0.5, 0.5, 0. ],
I get similar results like the TfidfTransformer with the disabled use_idf, so
far so good...
>>> TfidfTransformer(use_idf=False, norm='l2',
>>> smooth_idf=False).fit_transform(bag).toarray()
array([[ 0.5, 0. , 0.5, 0.5, 0.5, 0. ],
[ 0.5, 0.5, 0. , 0. , 0.5, 0.5]])
Next the idfs:
The default equation is:
# idf = log ( number_of_docs / number_of_docs_where_term_appears )
And in the online documentation at
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html
I found the additional info:
> smooth_idf : boolean, default=True
> Smooth idf weights by adding one to document frequencies, as if an extra
> document was seen containing every term in the collection exactly once.
> Prevents zero divisions.
So that I assume that the smooth_idf is calculated as follows:
# smooth_idf = log ( number_of_docs / (1 + number_of_docs_where_term_appears) )
Also, on the same page, I read:
> The actual formula used for tf-idf is tf * (idf + 1) = tf + tf * idf, instead
> of tf * idf.
However, when I calculate the tfidf based on the smooth_idf, I get slightly
different results as shown below:
For the 1st term frequency in the 1st vector:
>>> smooth_idf_0 = np.log10(2 / (1+2))
>>> 0.5 * (smooth_idf_0+1)
0.41195
# 0.4099 via the tfidf transformer
For the 2nd term frequency in the 1st vector:
>>> smooth_idf_1 = np.log10(2 / (1+1))
>>> 0.0 * (smooth_idf_1+1)
0.0
# 0.0 via the tfidf transformer
For the 3rd term frequency in the 1st vector:
smooth_idf_2 = np.log10(2 / (1+1))
>>> 0.5 * (smooth_idf_2+1)
0.5
# 0.5761 via the tfidf transformer
I tried different things (natural log, smooth_idf=False, etc.) but couldn't
reproduce that are returned from the TfidfTransformer
I hope someone can help me with this!
PS: Maybe a toy example can be added to the documentation in future for clarity.
Best,
Sebastian
------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general