The benefits to a byte count are substantial, including:

   1. Lazy fields can skip strings without reading them, as they do for
      all other value types.
   2. The file format could be changed to standard UTF-8 without any
      significant performance cost
   3. Any other index operation that relies on the index format will
      have an easier time with a representation that is a) easy to
      quickly scan and b) consistent (all value types start with a byte
      count).

Re. 3, Jian is concerned about programs in other languages that
manipulate Lucene index files.  I have such a program in Java and face
the same issue.  My case is a robust and general implementation of
IndexUpdater that copies segments transforming field values, updating
both stored values and postings (not yet term vectors).  It is optimized
to skip (copy) and/or minimally process unchanged areas, which are
typically most areas.  This process is slowed when processing unchanged
stored String values due to the current char count representation -- it
faces precisely the same issue as the lazy fields mechanism.

Re. the file format compatibility issue, if backward compatibility is a
requirement here, then it would seem to be necessary to have a
configuration option to choose the encoding of stored strings.  It seems
easy to generalize the Lucene API's to specify an interface for any
desired encode/decode.

Chuck


jian chen wrote on 05/02/2006 08:15 AM:
> Hi, Doug,
>
> I totally agree with what you said. Yeah, I think it is more of a file
> format issue, less of an API issue. It seems that we just need to add an
> extra constructor to Term.java to take in utf8 byte array.
>
> Lucene 2.0 is going to break the backward compability anyway, right? So,
> maybe this change to standard UTF-8 could be a hot item on the Lucene
> 2.0list?
>
> Cheers,
>
> Jian Chen
>
> On 5/2/06, Doug Cutting <[EMAIL PROTECTED]> wrote:
>>
>> Chuck Williams wrote:
>> > For lazy fields, there would be a substantial benefit to having the
>> > count on a String be an encoded byte count rather than a Java char
>> > count, but this has the same problem.  If there is a way to beat this
>> > problem, then I'd start arguing for a byte count.
>>
>> I think the way to beat it is to keep things as bytes as long as
>> possible.  For example, each term in a Query needs to be converted from
>> String to byte[], but after that all search computation could happen
>> comparing byte arrays.  (Note that lexicographic comparisons of UTF-8
>> encoded bytes give the same results as lexicographic comparisions of
>> Unicode character strings.)  And, when indexing, each Token would need
>> to be converted from String to byte[] just once.
>>
>> The Java API can easily be made back-compatible.  The harder part would
>> be making the file format back-compatible.
>>
>> Doug
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to