[lucy-dev] Implementing a tokenizer in core

Nick Wellnhofer Tue, 22 Nov 2011 14:11:37 -0800

Currently, Lucy only provides the RegexTokenizer which is implemented ontop of the perl regex engine. With the help of utf8proc we couldimplement a simple but more efficient tokenizer without externaldependencies in core. Most important, we'd have to implement somethingsimilar to the \w regex character class. The Unicode standard [1,2]recommends that \w is equivalent to[\pL\pM\p{Nd}\p{Nl}\p{Pc}\x{24b6}-\x{24e9}], that is Unicode categoriesLetter, Mark, Decimal_Number, Letter_Number, and Connector_Punctuationplus circled letters. That's exactly how perl implements \w. Otherimplementations like .NET seem to differ slightly [3]. So we couldlookup Unicode categories with utf8proc and then a perl-compatible checkfor a word character would be as easy as (cat <= 10 || cat == 12 || c >=0x24b6 && c <= 0x24e9).

The default regex in RegexTokenizer also handles apostrophes which Idon't find very useful personally. But this could also be implemented inthe core tokenizer.

I'm wondering what other kind of regexes people are using withRegexTokenizer, and whether this simple core tokenizer should becustomizable for some of these use cases.


Nick

[1] http://www.unicode.org/reports/tr18/#Compatibility_Properties
[2] http://www.unicode.org/Public/UNIDATA/DerivedCoreProperties.txt
[3] http://msdn.microsoft.com/en-us/library/20bw873z.aspx#WordCharacter

[lucy-dev] Implementing a tokenizer in core

Reply via email to