2013/2/25 Vlad Niculae <zephy...@gmail.com>: > This is certainly a case where the default behaviour cannot possibly > please everybody. I can't think of an application where changing > tokenization and preprocessing wouldn't help. > > For instance you often want to replace all numbers with the same > token. Possibly you want a different token for numbers and for > currency.
Hear hear. This is one of the great ironies of NLP: intuitively, tokenizing is the first part of the pipeline, but in practice, doing it perfectly is AI-complete so you have to resort to domain-specific heuristics. > I think that a regexp covering too much would actually make it more > difficult for the user to change it, and indeed to realize that it > needs to be changed. > > This being said, maybe something like (?u)\b\S\w+\b’ to allow tokens > that start with symbols? But all of a sudden with this little change, > I find the regexp's intent harder to see. I'd be interested in a better RE, but it would have to be short and sweet, and not use any fancy non-regular extensions such as backreferences that can cause exponential-time matching (opening up a can of DoS worms if someone uses scikit-learn in a web app; see http://swtch.com/~rsc/regexp/regexp1.html). The obvious ur'(?u)\b[\$\w]{2,}\b' doesn't work because \b eats up the $ before [$\w] gets to see it. This can get hairy real quick. (I don't see the bug in the docs...) -- Lars Buitinck Scientific programmer, ILPS University of Amsterdam ------------------------------------------------------------------------------ Everyone hates slow websites. So do we. Make your web apps faster with AppDynamics Download AppDynamics Lite for free today: http://p.sf.net/sfu/appdyn_d2d_feb _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general