Re: storing term text internally as byte array and bytecount as prefix, etc.

Doug Cutting Tue, 02 May 2006 09:16:38 -0700

Chuck Williams wrote:

For lazy fields, there would be a substantial benefit to having the
count on a String be an encoded byte count rather than a Java char
count, but this has the same problem.  If there is a way to beat this
problem, then I'd start arguing for a byte count.

I think the way to beat it is to keep things as bytes as long aspossible. For example, each term in a Query needs to be converted fromString to byte[], but after that all search computation could happencomparing byte arrays. (Note that lexicographic comparisons of UTF-8encoded bytes give the same results as lexicographic comparisions ofUnicode character strings.) And, when indexing, each Token would needto be converted from String to byte[] just once.

The Java API can easily be made back-compatible. The harder part wouldbe making the file format back-compatible.


Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: storing term text internally as byte array and bytecount as prefix, etc.

Reply via email to