[Scikit-learn-general] Problem in text feature extraction (sklearn.feature_extraction.text)

Tom Fawcett Sun, 24 Feb 2013 10:53:13 -0800

First, thanks for all your great work on scikits.learn!  It’s making my life 
easier.


Second, I found surprising behavior in sklearn.feature_extraction.text.  I’m 
using TfidfVectorizer and CountVectorizer to process news stories.  The default 
tokenizer uses the regular expression '(?u)\b\w\w+\b’, which produces this 
tokenization:

"CITIGROUP CUTS APPLE PRICE TARGET TO $212 FROM $215”
=>  ['CITIGROUP', 'CUTS', 'APPLE', 'PRICE', 'TARGET', 'TO', '212', 'FROM', 
'215’]

I’d argue that ’$212’ should be tokenized as ’$212’ or not at all.  I can fix 
the regexp myself but this default behavior seems a little off.  It also 
produces weird tokenizations like:  “for $2.50” => [“for”, “50”]
which is wrong by any interpretation.

Also, a minor documentation bug: the documentation specifies that the default 
is:
    token_pattern=u'(?u)\b\w\w+\b’

The quoted regexp needs an r in front of it, without which the \w is 
interpolated in the string.

Regards,
-Tom


------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_feb
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

[Scikit-learn-general] Problem in text feature extraction (sklearn.feature_extraction.text)

Reply via email to