First, thanks for all your great work on scikits.learn! It’s making my life easier.
Second, I found surprising behavior in sklearn.feature_extraction.text. I’m using TfidfVectorizer and CountVectorizer to process news stories. The default tokenizer uses the regular expression '(?u)\b\w\w+\b’, which produces this tokenization: "CITIGROUP CUTS APPLE PRICE TARGET TO $212 FROM $215” => ['CITIGROUP', 'CUTS', 'APPLE', 'PRICE', 'TARGET', 'TO', '212', 'FROM', '215’] I’d argue that ’$212’ should be tokenized as ’$212’ or not at all. I can fix the regexp myself but this default behavior seems a little off. It also produces weird tokenizations like: “for $2.50” => [“for”, “50”] which is wrong by any interpretation. Also, a minor documentation bug: the documentation specifies that the default is: token_pattern=u'(?u)\b\w\w+\b’ The quoted regexp needs an r in front of it, without which the \w is interpolated in the string. Regards, -Tom ------------------------------------------------------------------------------ Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_feb _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general