Marvin Humphrey wrote:
I *think* that whether it was invalidly encoded or not wouldn't impact
searching -- it doesn't in KinoSearch. It should only affect display.
I think Java's approach of converting everything to unicode internally
is useful. One must still handle dirty input, but it easy to write
output that conforms to standards. I'd hate to lose that.
Java programs have a good reputation for supporting
internationalization, better than those written in languages that
primarily represent strings as byte arrays and library utilities for
handling encodings and character sets. Java's choice of 16-bit
characters may have been an error, but the general approach of
converting all textual data to unicode internally has led to fewer
internationalization issues than are common in other systems.
Detecting invalidly encoded text later doesn't help anything in and of
itself; lifting the requirement that everything be converted to Unicode
early on opens up some options.
How useful are those options? Are they worth the price? Converting to
unicode early permits one to, e.g., write encoding-independent
tokenizers, stemmers, etc. That seems like a lot to throw away.
UTF-8 has the property that bytewise lexicographic order is the same
as Unicode character order.
Yes. I'm suggesting that an unpatched TermBuffer would have problems
with my index with corrupt character data because the sort order by
bytestring may not be the same as sort order by Unicode code point.
I think you're saying that bytewise comparisons involving invalid UTF-8
may differ from comparisons of the unicode code points they represent.
But if they're invalid, they don't actually represent unicode code
points, so how can they be compared?
Doug
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]