2013/2/25 Vlad Niculae <zephy...@gmail.com>:
> This is certainly a case where the default behaviour cannot possibly
> please everybody. I can't think of an application where changing
> tokenization and preprocessing wouldn't help.
>
> For instance you often want to replace all numbers with the same
> token. Possibly you want a different token for numbers and for
> currency.

Hear hear.

This is one of the great ironies of NLP: intuitively, tokenizing is
the first part of the pipeline, but in practice, doing it perfectly is
AI-complete so you have to resort to domain-specific heuristics.

> I think that a regexp covering too much would actually make it more
> difficult for the user to change it, and indeed to realize that it
> needs to be changed.
>
> This being said, maybe something like (?u)\b\S\w+\b’ to allow tokens
> that start with symbols? But all of a sudden with this little change,
> I find the regexp's intent harder to see.

I'd be interested in a better RE, but it would have to be short and
sweet, and not use any fancy non-regular extensions such as
backreferences that can cause exponential-time matching (opening up a
can of DoS worms if someone uses scikit-learn in a web app; see
http://swtch.com/~rsc/regexp/regexp1.html).

The obvious ur'(?u)\b[\$\w]{2,}\b' doesn't work because \b eats up the
$ before [$\w] gets to see it. This can get hairy real quick.

(I don't see the bug in the docs...)

-- 
Lars Buitinck
Scientific programmer, ILPS
University of Amsterdam

------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_feb
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to