I was wondering what the rationale is for making the default token pattern
for the CountVectorizer require *2* or more alphanumeric characters to form
a token. This was not intuitive default behavior for me, so I ended up with
a bug where some strings in my vocabulary like "Hepatitis A" were not
counted. I can see how it could be beneficial for removing stop words like
'a' and 'I', however I would argue for doing that through the stop_words
parameter instead.
Regards,
-Nathan Breit
------------------------------------------------------------------------------
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general