On 24/11/2011 22:41, Marvin Humphrey wrote:
On Wed, Nov 23, 2011 at 10:53:54PM +0100, Nick Wellnhofer wrote:
On 23/11/11 03:50, Marvin Humphrey wrote:
How about making this tokenizer implement the word break rules described in
the Unicode standard annex on Text Segmentation?  That's what the Lucene
StandardTokenizer does (as of 3.1).

That would certainly be a nice choice for the default tokenizer. It
would be easy to implement with ICU but utf8proc doesn't buy us much
here.

Hmm, that's unfortunate.  I think this would be a very nice feature to offer.

I had a closer look at the word boundary rules in UAX #29, and they shouldn't be too hard to implement without using an external library. I started with an initial prototype and it looks very promising.

In order to lookup the Word_Break property values, we have to precompute a few tables. I would write a Perl script for that. The tables can be generated once and shipped with the source code much like the tables for utf8proc. I'm not sure where to put that script and the generated tables, though.

Nick

Reply via email to