the missing 2 in tokenizing 2.50 is indeed a bit weird, though.
Tom Fawcett <tom.fawc...@gmail.com> schrieb:
>First, thanks for all your great work on scikits.learn! It’s making my
>life easier.
>
>Second, I found surprising behavior in sklearn.feature_extraction.text.
>I’m using TfidfVectorizer and CountVectorizer to process news stories.
>The default tokenizer uses the regular expression '(?u)\b\w\w+\b’,
>which produces this tokenization:
>
>"CITIGROUP CUTS APPLE PRICE TARGET TO $212 FROM $215”
>=> ['CITIGROUP', 'CUTS', 'APPLE', 'PRICE', 'TARGET', 'TO', '212',
>'FROM', '215’]
>
>I’d argue that ’$212’ should be tokenized as ’$212’ or not at all. I
>can fix the regexp myself but this default behavior seems a little off.
>It also produces weird tokenizations like: “for $2.50” => [“for”,
>“50”]
>which is wrong by any interpretation.
>
>Also, a minor documentation bug: the documentation specifies that the
>default is:
> token_pattern=u'(?u)\b\w\w+\b’
>
>The quoted regexp needs an r in front of it, without which the \w is
>interpolated in the string.
>
>Regards,
>-Tom
>
>
>------------------------------------------------------------------------------
>Everyone hates slow websites. So do we.
>Make your web apps faster with AppDynamics
>Download AppDynamics Lite for free today:
>http://p.sf.net/sfu/appdyn_d2d_feb
>_______________________________________________
>Scikit-learn-general mailing list
>Scikit-learn-general@lists.sourceforge.net
>https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
--
Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail gesendet.
------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_feb
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general