Marvin Humphrey wrote:
I *think* that whether it was invalidly encoded or not wouldn't impact searching -- it doesn't in KinoSearch. It should only affect display.

I think Java's approach of converting everything to unicode internally is useful. One must still handle dirty input, but it easy to write output that conforms to standards. I'd hate to lose that.

Java programs have a good reputation for supporting internationalization, better than those written in languages that primarily represent strings as byte arrays and library utilities for handling encodings and character sets. Java's choice of 16-bit characters may have been an error, but the general approach of converting all textual data to unicode internally has led to fewer internationalization issues than are common in other systems.

Detecting invalidly encoded text later doesn't help anything in and of itself; lifting the requirement that everything be converted to Unicode early on opens up some options.

How useful are those options? Are they worth the price? Converting to unicode early permits one to, e.g., write encoding-independent tokenizers, stemmers, etc. That seems like a lot to throw away.

UTF-8 has the property that bytewise lexicographic order is the same as Unicode character order.


Yes. I'm suggesting that an unpatched TermBuffer would have problems with my index with corrupt character data because the sort order by bytestring may not be the same as sort order by Unicode code point.

I think you're saying that bytewise comparisons involving invalid UTF-8 may differ from comparisons of the unicode code points they represent. But if they're invalid, they don't actually represent unicode code points, so how can they be compared?

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to