The benefits to a byte count are substantial, including: 1. Lazy fields can skip strings without reading them, as they do for all other value types. 2. The file format could be changed to standard UTF-8 without any significant performance cost 3. Any other index operation that relies on the index format will have an easier time with a representation that is a) easy to quickly scan and b) consistent (all value types start with a byte count).
Re. 3, Jian is concerned about programs in other languages that manipulate Lucene index files. I have such a program in Java and face the same issue. My case is a robust and general implementation of IndexUpdater that copies segments transforming field values, updating both stored values and postings (not yet term vectors). It is optimized to skip (copy) and/or minimally process unchanged areas, which are typically most areas. This process is slowed when processing unchanged stored String values due to the current char count representation -- it faces precisely the same issue as the lazy fields mechanism. Re. the file format compatibility issue, if backward compatibility is a requirement here, then it would seem to be necessary to have a configuration option to choose the encoding of stored strings. It seems easy to generalize the Lucene API's to specify an interface for any desired encode/decode. Chuck jian chen wrote on 05/02/2006 08:15 AM: > Hi, Doug, > > I totally agree with what you said. Yeah, I think it is more of a file > format issue, less of an API issue. It seems that we just need to add an > extra constructor to Term.java to take in utf8 byte array. > > Lucene 2.0 is going to break the backward compability anyway, right? So, > maybe this change to standard UTF-8 could be a hot item on the Lucene > 2.0list? > > Cheers, > > Jian Chen > > On 5/2/06, Doug Cutting <[EMAIL PROTECTED]> wrote: >> >> Chuck Williams wrote: >> > For lazy fields, there would be a substantial benefit to having the >> > count on a String be an encoded byte count rather than a Java char >> > count, but this has the same problem. If there is a way to beat this >> > problem, then I'd start arguing for a byte count. >> >> I think the way to beat it is to keep things as bytes as long as >> possible. For example, each term in a Query needs to be converted from >> String to byte[], but after that all search computation could happen >> comparing byte arrays. (Note that lexicographic comparisons of UTF-8 >> encoded bytes give the same results as lexicographic comparisions of >> Unicode character strings.) And, when indexing, each Token would need >> to be converted from String to byte[] just once. >> >> The Java API can easily be made back-compatible. The harder part would >> be making the file format back-compatible. >> >> Doug >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]