Re: Unintuitive NGramTokenizer behavior

Grant Ingersoll Thu, 03 Mar 2011 11:07:04 -0800

On Mar 3, 2011, at 1:10 PM, Robert Muir wrote:

> On Thu, Mar 3, 2011 at 1:00 PM, Grant Ingersoll <[email protected]> wrote:
>> 
>> Unfortunately, I'm not following your reasons for doing it.  I won't say I'm
>> against it at this point, but I don't see a compelling reason to change it
>> either so if you could clarify that would be great.  It's been around for
>> quite some time in it's current form and I think fits most people's
>> expectations of ngrams.
> 
> Grant I'm sorry, but I couldnt disagree more.
> 
> There are many variations on ngram tokenization (word-internal,
> word-spanning, skipgrams), besides allowing flexibility for what
> should be a "word character" and what should not be (e.g.
> punctuation), and how to handle the specifics of these.
> 
> But our n-gram tokenizer is *UNARGUABLY* completely broken for these reasons:
> 1. it discards anything after the first 1024 code units of the document.
> 2. it uses partial characters (UTF-16 code units) as its fundamental
> measure, potentially creating lots of invalid unicode.
> 3. it forms n-grams in the wrong order, contributing to #1. I
> explained this in LUCENE-1224


Sure, but those are ancillary to the whitespace question that was asked about.

> 
> Its these reasons that I suggested we completely rewrite it... people
> that are just indexing english documents with < 1024 chars per
> document and don't care about these things can use
> ClassicNGramTokenizer.


Fair enough.  Always open to improvements.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Unintuitive NGramTokenizer behavior

Reply via email to