Chuck Williams wrote:
For lazy fields, there would be a substantial benefit to having the
count on a String be an encoded byte count rather than a Java char
count, but this has the same problem.  If there is a way to beat this
problem, then I'd start arguing for a byte count.

I think the way to beat it is to keep things as bytes as long as possible. For example, each term in a Query needs to be converted from String to byte[], but after that all search computation could happen comparing byte arrays. (Note that lexicographic comparisons of UTF-8 encoded bytes give the same results as lexicographic comparisions of Unicode character strings.) And, when indexing, each Token would need to be converted from String to byte[] just once.

The Java API can easily be made back-compatible. The harder part would be making the file format back-compatible.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to