Okay, I figured it out... there is a typo in the documentation: Instead of tfidf = tf * (idf + 1) as described in the documentation at http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html
it is actually: tfidf = tf * idf + 1.0, where idf = np.log(len(n_docs) / df) Example: import numpy as np docs = np.array([ 'The sun is shining', 'The weather is sweet', 'The sun is shining and the weather is sweet']) from sklearn.feature_extraction.text import CountVectorizer cv = CountVectorizer() tf = cv.fit_transform(docs).toarray() tf array([[0, 1, 1, 1, 0, 1, 0], [0, 1, 0, 0, 1, 1, 1], [1, 2, 1, 1, 1, 2, 1]], dtype=int64) cv.vocabulary_ {'and': 0, 'is': 1, 'shining': 2, 'sun': 3, 'sweet': 4, 'the': 5, 'weather': 6} tfidf = TfidfTransformer(use_idf=True, smooth_idf=False, norm=None) tfidf.fit_transform(tf).toarray() array([[ 0. , 1. , 1.40546511, 1.40546511, 0. , 1. , 0. ], [ 0. , 1. , 0. , 0. , 1.40546511, 1. , 1.40546511], [ 2.09861229, 2. , 1.40546511, 1.40546511, 1.40546511, 2. , 1.40546511]]) Now, calculating the tf-idf of the first 3 words in the vocabulary "and", "shining", and" is document 3: tf_and = 1 df_and = 1 tf_and * (np.log(len(docs) / df_and) + 1.0) 2.09861228866811 tf_shining = 2 df_shining = 3 tf_shining * (np.log(len(docs) / df_shining) + 1.0) 2.0 tf_is = 1 df_is = 2 tf_is * (np.log(len(docs) / df_is) + 1.0) 1.4054651081081644 > On May 22, 2015, at 5:15 PM, Sebastian Raschka <se.rasc...@gmail.com> wrote: > > Thanks, Lars, that's what I thought (natural log). I will try some more > combinations later and browse through the source code to see if I can somehow > manage to reproduce the results. Maybe it would be good to write it up as an > example then for the documentation -- in case someone else is wondering about > it since it is slightly different from the "classic" tf-idf approach. > > Btw. is there anything that speaks against those negative values in the > feature vectors? I mean for e.g., SGD classifiers it can maybe be beneficial > to have values that can be positive and negative. > > Best, > Sebastian > > >> On May 22, 2015, at 12:00 PM, Lars Buitinck <larsm...@gmail.com> wrote: >> >> 2015-05-22 8:29 GMT+02:00 Sebastian Raschka <se.rasc...@gmail.com>: >>> The default equation is: >>> # idf = log ( number_of_docs / number_of_docs_where_term_appears ) >>> >>> And in the online documentation at >>> http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html >>> I found the additional info: >>>> smooth_idf : boolean, default=True >>>> Smooth idf weights by adding one to document frequencies, as if an extra >>>> document was seen containing every term in the collection exactly once. >>>> Prevents zero divisions. >>> >>> >>> So that I assume that the smooth_idf is calculated as follows: >>> # smooth_idf = log ( number_of_docs / (1 + >>> number_of_docs_where_term_appears) ) >> >> I don't have a full answer ready, but note that number_of_docs must >> also be incremented by the smoothing term (which is actually a >> misnomer, IIRC). Otherwise the logs can come out negative. >> >> Logs are also always natural logs in scikit-learn. >> >> HTH >> >> ------------------------------------------------------------------------------ >> One dashboard for servers and applications across Physical-Virtual-Cloud >> Widest out-of-the-box monitoring support with 50+ applications >> Performance metrics, stats and reports that give you Actionable Insights >> Deep dive visibility with transaction tracing using APM Insight. >> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >> _______________________________________________ >> Scikit-learn-general mailing list >> Scikit-learn-general@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > > ------------------------------------------------------------------------------ > One dashboard for servers and applications across Physical-Virtual-Cloud > Widest out-of-the-box monitoring support with 50+ applications > Performance metrics, stats and reports that give you Actionable Insights > Deep dive visibility with transaction tracing using APM Insight. > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general ------------------------------------------------------------------------------ One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general