Re: [lucy-dev] Implementing a tokenizer in core

Nick Wellnhofer Wed, 30 Nov 2011 07:40:35 -0800

On 24/11/2011 22:41, Marvin Humphrey wrote:

On Wed, Nov 23, 2011 at 10:53:54PM +0100, Nick Wellnhofer wrote:

On 23/11/11 03:50, Marvin Humphrey wrote:

How about making this tokenizer implement the word break rules described in
the Unicode standard annex on Text Segmentation?  That's what the Lucene
StandardTokenizer does (as of 3.1).


That would certainly be a nice choice for the default tokenizer. It
would be easy to implement with ICU but utf8proc doesn't buy us much
here.


Hmm, that's unfortunate.  I think this would be a very nice feature to offer.

I had a closer look at the word boundary rules in UAX #29, and theyshouldn't be too hard to implement without using an external library. Istarted with an initial prototype and it looks very promising.

In order to lookup the Word_Break property values, we have to precomputea few tables. I would write a Perl script for that. The tables can begenerated once and shipped with the source code much like the tables forutf8proc. I'm not sure where to put that script and the generatedtables, though.


Nick

Re: [lucy-dev] Implementing a tokenizer in core

Reply via email to