[Scikit-learn-general] CountVectorizer token pattern

Nathan Breit Wed, 17 Sep 2014 00:26:22 -0700

I was wondering what the rationale is for making the default token pattern
for the CountVectorizer require *2* or more alphanumeric characters to form
a token. This was not intuitive default behavior for me, so I ended up with
a bug where some strings in my vocabulary like "Hepatitis A" were not
counted. I can see how it could be beneficial for removing stop words like
'a' and 'I', however I would argue for doing that through the stop_words
parameter instead.
Regards,
-Nathan Breit

------------------------------------------------------------------------------
Want excitement?
Manually upgrade your production database.
When you want reliability, choose Perforce
Perforce version control. Predictably reliable.
http://pubads.g.doubleclick.net/gampad/clk?id=157508191&iu=/4140/ostg.clktrk

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

[Scikit-learn-general] CountVectorizer token pattern

Reply via email to