Hi, I am wondering how exactly the tf-idfs are calculated in the TfidfTransformer or TfidfVectorizer since I can't really reproduce the results. It would be great if someone could help me a little bit here. E.g., let's consider those 2 simple documents that can be transformed into a bag-of-words representation via the CountVectorizer:
>>> from sklearn.feature_extraction.text import CountVectorizer >>> count = CountVectorizer() >>> docs = ['the sun is shining', 'the weather is nice',] >>> bag = count.fit_transform(docs) >>> bag.toarray() array([[1, 0, 1, 1, 1, 0], [1, 1, 0, 0, 1, 1]], dtype=int64) The following results are then produced by the TfidfTransformer using the default settings: >>> from sklearn.feature_extraction.text import TfidfTransformer >>> tfidf = TfidfTransformer(use_idf=True, norm='l2', smooth_idf=True) >>> tfidf.fit_transform(bag).toarray() [[ 0.40993715 0. 0.57615236 0.57615236 0.40993715 0. ] [ 0.40993715 0.57615236 0. 0. 0.40993715 0.57615236]] Now, let me "manually" l-2 normalize the term frequencies of the first document: >>> tfs = [1, 0, 1, 1, 1, 0] >>> l2 = np.sqrt(np.sum(([i**2 for i in tfs]))) >>> tfs_l2 = [(i / l2) for i in tfs] >>> tfs_l2 [ 0.5, 0. , 0.5, 0.5, 0.5, 0. ], I get similar results like the TfidfTransformer with the disabled use_idf, so far so good... >>> TfidfTransformer(use_idf=False, norm='l2', >>> smooth_idf=False).fit_transform(bag).toarray() array([[ 0.5, 0. , 0.5, 0.5, 0.5, 0. ], [ 0.5, 0.5, 0. , 0. , 0.5, 0.5]]) Next the idfs: The default equation is: # idf = log ( number_of_docs / number_of_docs_where_term_appears ) And in the online documentation at http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html I found the additional info: > smooth_idf : boolean, default=True > Smooth idf weights by adding one to document frequencies, as if an extra > document was seen containing every term in the collection exactly once. > Prevents zero divisions. So that I assume that the smooth_idf is calculated as follows: # smooth_idf = log ( number_of_docs / (1 + number_of_docs_where_term_appears) ) Also, on the same page, I read: > The actual formula used for tf-idf is tf * (idf + 1) = tf + tf * idf, instead > of tf * idf. However, when I calculate the tfidf based on the smooth_idf, I get slightly different results as shown below: For the 1st term frequency in the 1st vector: >>> smooth_idf_0 = np.log10(2 / (1+2)) >>> 0.5 * (smooth_idf_0+1) 0.41195 # 0.4099 via the tfidf transformer For the 2nd term frequency in the 1st vector: >>> smooth_idf_1 = np.log10(2 / (1+1)) >>> 0.0 * (smooth_idf_1+1) 0.0 # 0.0 via the tfidf transformer For the 3rd term frequency in the 1st vector: smooth_idf_2 = np.log10(2 / (1+1)) >>> 0.5 * (smooth_idf_2+1) 0.5 # 0.5761 via the tfidf transformer I tried different things (natural log, smooth_idf=False, etc.) but couldn't reproduce that are returned from the TfidfTransformer I hope someone can help me with this! PS: Maybe a toy example can be added to the documentation in future for clarity. Best, Sebastian ------------------------------------------------------------------------------ One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general