Re: Hacking Luke for bytecount-based strings

Doug Cutting Wed, 17 May 2006 14:05:13 -0700

Marvin Humphrey wrote:

I *think* that whether it was invalidly encoded or not wouldn't impactsearching -- it doesn't in KinoSearch. It should only affect display.

I think Java's approach of converting everything to unicode internallyis useful. One must still handle dirty input, but it easy to writeoutput that conforms to standards. I'd hate to lose that.

Java programs have a good reputation for supportinginternationalization, better than those written in languages thatprimarily represent strings as byte arrays and library utilities forhandling encodings and character sets. Java's choice of 16-bitcharacters may have been an error, but the general approach ofconverting all textual data to unicode internally has led to fewerinternationalization issues than are common in other systems.

Detecting invalidly encoded text later doesn't help anything in and ofitself; lifting the requirement that everything be converted to Unicodeearly on opens up some options.

How useful are those options? Are they worth the price? Converting tounicode early permits one to, e.g., write encoding-independenttokenizers, stemmers, etc. That seems like a lot to throw away.

UTF-8 has the property that bytewise lexicographic order is the sameas Unicode character order.
Yes. I'm suggesting that an unpatched TermBuffer would have problemswith my index with corrupt character data because the sort order bybytestring may not be the same as sort order by Unicode code point.

I think you're saying that bytewise comparisons involving invalid UTF-8may differ from comparisons of the unicode code points they represent.But if they're invalid, they don't actually represent unicode codepoints, so how can they be compared?


Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Hacking Luke for bytecount-based strings

Reply via email to