On May 17, 2006, at 2:04 PM, Doug Cutting wrote:

Detecting invalidly encoded text later doesn't help anything in and of itself; lifting the requirement that everything be converted to Unicode early on opens up some options.

How useful are those options? Are they worth the price? Converting to unicode early permits one to, e.g., write encoding- independent tokenizers, stemmers, etc. That seems like a lot to throw away.

Fair enough. For Java Lucene, the main benefits of encoding flexibility would accrue when A) your material takes up a lot more space in UTF-8 than in another alternative, or B) you prefer a native encoding to Unicode, most often because of the Han unification controversy.

The space issue could be addressed by allowing UTF-16 as an alternative. Catering to arbitrary encodings doesn't offer that much benefit for the price, though your perspective on that may differ if you're, say, Japanese.

UTF-8 has the property that bytewise lexicographic order is the same as Unicode character order.
Yes. I'm suggesting that an unpatched TermBuffer would have problems with my index with corrupt character data because the sort order by bytestring may not be the same as sort order by Unicode code point.

I think you're saying that bytewise comparisons involving invalid UTF-8 may differ from comparisons of the unicode code points they represent. But if they're invalid, they don't actually represent unicode code points, so how can they be compared?

Repairing an invalid Unicode sequence, whether UTF-8, UTF-16BE, or other, generally means swapping in U+FFFD "REPLACEMENT CHARACTER", provided that you don't throw a fatal error. U+FFFD has a numeric value which affects sort order, and the swap may also affect the length of the sequence.

More generally, if you map from valid data in another encoding to Unicode, lexical sorting of the source bytestring and lexical sorting of the Unicode target will often produce differing results.

It really messes up a TermInfosReader to have terms out of sequence. And unfortunately I misremembered how the cached Terms from the auxiliary term dictionary get compared -- those use term.compareTo (otherTerm) rather than termBuffer.compareTo(otherTermBuffer). The patched version of Lucene doesn't change that, so if an invalidly encoded term with a replacement character happens to fall on an index point, bad things will happen.

That means the current patch is inadequate for dealing with KinoSearch 0.05 or Ferret indexes unless the application developer forced UTF-8 at index-time. I'd need to make additional changes in order to guarantee that a patched Luke would work -- TermInfosReader would need to cache the bytestrings and compare those instead. That's effectively what KinoSearch does.

That's probably a good idea anyway, as it cuts down the RAM requirements for caching the Term Infos Index -- so long as your data occupies less space as a bytestring than as Java chars.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to