Re: Hacking Luke for bytecount-based strings

Marvin Humphrey Wed, 17 May 2006 16:02:15 -0700


On May 17, 2006, at 2:04 PM, Doug Cutting wrote:

Detecting invalidly encoded text later doesn't help anything inand of itself; lifting the requirement that everything beconverted to Unicode early on opens up some options.
How useful are those options? Are they worth the price?Converting to unicode early permits one to, e.g., write encoding-independent tokenizers, stemmers, etc. That seems like a lot tothrow away.

Fair enough. For Java Lucene, the main benefits of encodingflexibility would accrue when A) your material takes up a lot morespace in UTF-8 than in another alternative, or B) you prefer a nativeencoding to Unicode, most often because of the Han unificationcontroversy.

The space issue could be addressed by allowing UTF-16 as analternative. Catering to arbitrary encodings doesn't offer that muchbenefit for the price, though your perspective on that may differ ifyou're, say, Japanese.

UTF-8 has the property that bytewise lexicographic order is thesame as Unicode character order.
Yes. I'm suggesting that an unpatched TermBuffer would haveproblems with my index with corrupt character data because thesort order by bytestring may not be the same as sort order byUnicode code point.
I think you're saying that bytewise comparisons involving invalidUTF-8 may differ from comparisons of the unicode code points theyrepresent. But if they're invalid, they don't actually representunicode code points, so how can they be compared?

Repairing an invalid Unicode sequence, whether UTF-8, UTF-16BE, orother, generally means swapping in U+FFFD "REPLACEMENT CHARACTER",provided that you don't throw a fatal error. U+FFFD has a numericvalue which affects sort order, and the swap may also affect thelength of the sequence.

More generally, if you map from valid data in another encoding toUnicode, lexical sorting of the source bytestring and lexical sortingof the Unicode target will often produce differing results.

It really messes up a TermInfosReader to have terms out of sequence.And unfortunately I misremembered how the cached Terms from theauxiliary term dictionary get compared -- those use term.compareTo(otherTerm) rather than termBuffer.compareTo(otherTermBuffer). Thepatched version of Lucene doesn't change that, so if an invalidlyencoded term with a replacement character happens to fall on an indexpoint, bad things will happen.

That means the current patch is inadequate for dealing withKinoSearch 0.05 or Ferret indexes unless the application developerforced UTF-8 at index-time. I'd need to make additional changes inorder to guarantee that a patched Luke would work -- TermInfosReaderwould need to cache the bytestrings and compare those instead.That's effectively what KinoSearch does.

That's probably a good idea anyway, as it cuts down the RAMrequirements for caching the Term Infos Index -- so long as your dataoccupies less space as a bytestring than as Java chars.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Hacking Luke for bytecount-based strings

Reply via email to