Okay, I got it now and put a short notebook together for personal reference if someone is interested:
https://github.com/rasbt/pattern_classification/blob/master/machine_learning/scikit-learn/tfidf_scikit-learn.ipynb > On May 23, 2015, at 4:39 PM, Sebastian Raschka <se.rasc...@gmail.com> wrote: > > Oh, sorry, never-mind my last mail. > > >> On May 22, 2015, at 5:15 PM, Sebastian Raschka <se.rasc...@gmail.com> wrote: >> >> Thanks, Lars, that's what I thought (natural log). I will try some more >> combinations later and browse through the source code to see if I can >> somehow manage to reproduce the results. Maybe it would be good to write it >> up as an example then for the documentation -- in case someone else is >> wondering about it since it is slightly different from the "classic" tf-idf >> approach. >> >> Btw. is there anything that speaks against those negative values in the >> feature vectors? I mean for e.g., SGD classifiers it can maybe be beneficial >> to have values that can be positive and negative. >> >> Best, >> Sebastian >> >> >>> On May 22, 2015, at 12:00 PM, Lars Buitinck <larsm...@gmail.com> wrote: >>> >>> 2015-05-22 8:29 GMT+02:00 Sebastian Raschka <se.rasc...@gmail.com>: >>>> The default equation is: >>>> # idf = log ( number_of_docs / number_of_docs_where_term_appears ) >>>> >>>> And in the online documentation at >>>> http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html >>>> I found the additional info: >>>>> smooth_idf : boolean, default=True >>>>> Smooth idf weights by adding one to document frequencies, as if an extra >>>>> document was seen containing every term in the collection exactly once. >>>>> Prevents zero divisions. >>>> >>>> >>>> So that I assume that the smooth_idf is calculated as follows: >>>> # smooth_idf = log ( number_of_docs / (1 + >>>> number_of_docs_where_term_appears) ) >>> >>> I don't have a full answer ready, but note that number_of_docs must >>> also be incremented by the smoothing term (which is actually a >>> misnomer, IIRC). Otherwise the logs can come out negative. >>> >>> Logs are also always natural logs in scikit-learn. >>> >>> HTH >>> >>> ------------------------------------------------------------------------------ >>> One dashboard for servers and applications across Physical-Virtual-Cloud >>> Widest out-of-the-box monitoring support with 50+ applications >>> Performance metrics, stats and reports that give you Actionable Insights >>> Deep dive visibility with transaction tracing using APM Insight. >>> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >>> _______________________________________________ >>> Scikit-learn-general mailing list >>> Scikit-learn-general@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >> >> >> ------------------------------------------------------------------------------ >> One dashboard for servers and applications across Physical-Virtual-Cloud >> Widest out-of-the-box monitoring support with 50+ applications >> Performance metrics, stats and reports that give you Actionable Insights >> Deep dive visibility with transaction tracing using APM Insight. >> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y >> _______________________________________________ >> Scikit-learn-general mailing list >> Scikit-learn-general@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > ------------------------------------------------------------------------------ One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general