Currently, Lucy only provides the RegexTokenizer which is implemented on
top of the perl regex engine. With the help of utf8proc we could
implement a simple but more efficient tokenizer without external
dependencies in core. Most important, we'd have to implement something
similar to the \w regex character class. The Unicode standard [1,2]
recommends that \w is equivalent to
[\pL\pM\p{Nd}\p{Nl}\p{Pc}\x{24b6}-\x{24e9}], that is Unicode categories
Letter, Mark, Decimal_Number, Letter_Number, and Connector_Punctuation
plus circled letters. That's exactly how perl implements \w. Other
implementations like .NET seem to differ slightly [3]. So we could
lookup Unicode categories with utf8proc and then a perl-compatible check
for a word character would be as easy as (cat <= 10 || cat == 12 || c >=
0x24b6 && c <= 0x24e9).
The default regex in RegexTokenizer also handles apostrophes which I
don't find very useful personally. But this could also be implemented in
the core tokenizer.
I'm wondering what other kind of regexes people are using with
RegexTokenizer, and whether this simple core tokenizer should be
customizable for some of these use cases.
Nick
[1] http://www.unicode.org/reports/tr18/#Compatibility_Properties
[2] http://www.unicode.org/Public/UNIDATA/DerivedCoreProperties.txt
[3] http://msdn.microsoft.com/en-us/library/20bw873z.aspx#WordCharacter