On May 17, 2006, at 2:04 PM, Doug Cutting wrote:
Detecting invalidly encoded text later doesn't help anything in
and of itself; lifting the requirement that everything be
converted to Unicode early on opens up some options.
How useful are those options? Are they worth the price?
Converting to unicode early permits one to, e.g., write encoding-
independent tokenizers, stemmers, etc. That seems like a lot to
throw away.
Fair enough. For Java Lucene, the main benefits of encoding
flexibility would accrue when A) your material takes up a lot more
space in UTF-8 than in another alternative, or B) you prefer a native
encoding to Unicode, most often because of the Han unification
controversy.
The space issue could be addressed by allowing UTF-16 as an
alternative. Catering to arbitrary encodings doesn't offer that much
benefit for the price, though your perspective on that may differ if
you're, say, Japanese.
UTF-8 has the property that bytewise lexicographic order is the
same as Unicode character order.
Yes. I'm suggesting that an unpatched TermBuffer would have
problems with my index with corrupt character data because the
sort order by bytestring may not be the same as sort order by
Unicode code point.
I think you're saying that bytewise comparisons involving invalid
UTF-8 may differ from comparisons of the unicode code points they
represent. But if they're invalid, they don't actually represent
unicode code points, so how can they be compared?
Repairing an invalid Unicode sequence, whether UTF-8, UTF-16BE, or
other, generally means swapping in U+FFFD "REPLACEMENT CHARACTER",
provided that you don't throw a fatal error. U+FFFD has a numeric
value which affects sort order, and the swap may also affect the
length of the sequence.
More generally, if you map from valid data in another encoding to
Unicode, lexical sorting of the source bytestring and lexical sorting
of the Unicode target will often produce differing results.
It really messes up a TermInfosReader to have terms out of sequence.
And unfortunately I misremembered how the cached Terms from the
auxiliary term dictionary get compared -- those use term.compareTo
(otherTerm) rather than termBuffer.compareTo(otherTermBuffer). The
patched version of Lucene doesn't change that, so if an invalidly
encoded term with a replacement character happens to fall on an index
point, bad things will happen.
That means the current patch is inadequate for dealing with
KinoSearch 0.05 or Ferret indexes unless the application developer
forced UTF-8 at index-time. I'd need to make additional changes in
order to guarantee that a patched Luke would work -- TermInfosReader
would need to cache the bytestrings and compare those instead.
That's effectively what KinoSearch does.
That's probably a good idea anyway, as it cuts down the RAM
requirements for caching the Term Infos Index -- so long as your data
occupies less space as a bytestring than as Java chars.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]