2013/6/2 Harold Nguyen <har...@nexgate.com>: > http://scikit-learn.org/dev/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html > Does TfidfVectorizer take a sequence of filenames, where each file is just a > plain text file ?
Depends on the parameter input (the first in the list). In the example, I set it to 'filename'. > Also, according to the link, I thought min_df only lives in the interval > [0.0, 1.0] or did I misunderstand ? If it's an "int" does that recommend the > number of occurrences rather than the frequency ? This is described in the docs (in somewhat poor grammar) as "If float, the parameter represents a proportion of documents, integer absolute counts." So yes, if min_df=2, that means at least two documents must contain a term for it to become a feature in X. -- Lars Buitinck Scientific programmer, ILPS University of Amsterdam ------------------------------------------------------------------------------ Get 100% visibility into Java/.NET code with AppDynamics Lite It's a free troubleshooting tool designed for production Get down to code-level detail for bottlenecks, with <2% overhead. Download for free and get started troubleshooting in minutes. http://p.sf.net/sfu/appdyn_d2d_ap2 _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general