Chuck Williams wrote:
For lazy fields, there would be a substantial benefit to having the count on a String be an encoded byte count rather than a Java char count, but this has the same problem. If there is a way to beat this problem, then I'd start arguing for a byte count.
I think the way to beat it is to keep things as bytes as long as possible. For example, each term in a Query needs to be converted from String to byte[], but after that all search computation could happen comparing byte arrays. (Note that lexicographic comparisons of UTF-8 encoded bytes give the same results as lexicographic comparisions of Unicode character strings.) And, when indexing, each Token would need to be converted from String to byte[] just once.
The Java API can easily be made back-compatible. The harder part would be making the file format back-compatible.
Doug --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]