On 06/02/2013 08:48 PM, Harold Nguyen wrote:
Hi Lars,
Thank you very much for this response. Please excuse my questions
since I'm new.
From here the document on TfidfVectorizer here:
http://scikit-learn.org/dev/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
Does TfidfVectorizer take a sequence of filenames, where each file is
just a plain text file ?
Also, according to the link, I thought min_df only lives in the
interval [0.0, 1.0] or did I misunderstand ? If it's an "int" does
that recommend the number of occurrences rather than the frequency ?
Yes, it is.
I named the variable, I think, and it is a bad name :-(
Should we rename it?
I think giving a count makes more sense than giving a frequency: you
want to exclude outliers that appear only once or twice for example.
Cheers,
Andy
------------------------------------------------------------------------------
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with <2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general