Currently, Lucy only provides the RegexTokenizer which is implemented on top of the perl regex engine. With the help of utf8proc we could implement a simple but more efficient tokenizer without external dependencies in core. Most important, we'd have to implement something similar to the \w regex character class. The Unicode standard [1,2] recommends that \w is equivalent to [\pL\pM\p{Nd}\p{Nl}\p{Pc}\x{24b6}-\x{24e9}], that is Unicode categories Letter, Mark, Decimal_Number, Letter_Number, and Connector_Punctuation plus circled letters. That's exactly how perl implements \w. Other implementations like .NET seem to differ slightly [3]. So we could lookup Unicode categories with utf8proc and then a perl-compatible check for a word character would be as easy as (cat <= 10 || cat == 12 || c >= 0x24b6 && c <= 0x24e9).

The default regex in RegexTokenizer also handles apostrophes which I don't find very useful personally. But this could also be implemented in the core tokenizer.

I'm wondering what other kind of regexes people are using with RegexTokenizer, and whether this simple core tokenizer should be customizable for some of these use cases.

Nick

[1] http://www.unicode.org/reports/tr18/#Compatibility_Properties
[2] http://www.unicode.org/Public/UNIDATA/DerivedCoreProperties.txt
[3] http://msdn.microsoft.com/en-us/library/20bw873z.aspx#WordCharacter

Reply via email to