On Mar 3, 2011, at 1:10 PM, Robert Muir wrote: > On Thu, Mar 3, 2011 at 1:00 PM, Grant Ingersoll <[email protected]> wrote: >> >> Unfortunately, I'm not following your reasons for doing it. I won't say I'm >> against it at this point, but I don't see a compelling reason to change it >> either so if you could clarify that would be great. It's been around for >> quite some time in it's current form and I think fits most people's >> expectations of ngrams. > > Grant I'm sorry, but I couldnt disagree more. > > There are many variations on ngram tokenization (word-internal, > word-spanning, skipgrams), besides allowing flexibility for what > should be a "word character" and what should not be (e.g. > punctuation), and how to handle the specifics of these. > > But our n-gram tokenizer is *UNARGUABLY* completely broken for these reasons: > 1. it discards anything after the first 1024 code units of the document. > 2. it uses partial characters (UTF-16 code units) as its fundamental > measure, potentially creating lots of invalid unicode. > 3. it forms n-grams in the wrong order, contributing to #1. I > explained this in LUCENE-1224
Sure, but those are ancillary to the whitespace question that was asked about. > > Its these reasons that I suggested we completely rewrite it... people > that are just indexing english documents with < 1024 chars per > document and don't care about these things can use > ClassicNGramTokenizer. Fair enough. Always open to improvements. --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
