Hi, Doug, I totally agree with what you said. Yeah, I think it is more of a file format issue, less of an API issue. It seems that we just need to add an extra constructor to Term.java to take in utf8 byte array.
Lucene 2.0 is going to break the backward compability anyway, right? So, maybe this change to standard UTF-8 could be a hot item on the Lucene 2.0list? Cheers, Jian Chen On 5/2/06, Doug Cutting <[EMAIL PROTECTED]> wrote:
Chuck Williams wrote: > For lazy fields, there would be a substantial benefit to having the > count on a String be an encoded byte count rather than a Java char > count, but this has the same problem. If there is a way to beat this > problem, then I'd start arguing for a byte count. I think the way to beat it is to keep things as bytes as long as possible. For example, each term in a Query needs to be converted from String to byte[], but after that all search computation could happen comparing byte arrays. (Note that lexicographic comparisons of UTF-8 encoded bytes give the same results as lexicographic comparisions of Unicode character strings.) And, when indexing, each Token would need to be converted from String to byte[] just once. The Java API can easily be made back-compatible. The harder part would be making the file format back-compatible. Doug --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]