Hi, Doug,

I totally agree with what you said. Yeah, I think it is more of a file
format issue, less of an API issue. It seems that we just need to add an
extra constructor to Term.java to take in utf8 byte array.

Lucene 2.0 is going to break the backward compability anyway, right? So,
maybe this change to standard UTF-8 could be a hot item on the Lucene 2.0list?

Cheers,

Jian Chen

On 5/2/06, Doug Cutting <[EMAIL PROTECTED]> wrote:

Chuck Williams wrote:
> For lazy fields, there would be a substantial benefit to having the
> count on a String be an encoded byte count rather than a Java char
> count, but this has the same problem.  If there is a way to beat this
> problem, then I'd start arguing for a byte count.

I think the way to beat it is to keep things as bytes as long as
possible.  For example, each term in a Query needs to be converted from
String to byte[], but after that all search computation could happen
comparing byte arrays.  (Note that lexicographic comparisons of UTF-8
encoded bytes give the same results as lexicographic comparisions of
Unicode character strings.)  And, when indexing, each Token would need
to be converted from String to byte[] just once.

The Java API can easily be made back-compatible.  The harder part would
be making the file format back-compatible.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Reply via email to