storing term text internally as byte array and bytecount as prefix, etc.

jian chen Mon, 01 May 2006 18:28:07 -0700

Hi, All,

Recently I have been following through the whole discussion on storing
text/string as standard UTF-8 and how to achieve that in Lucene.


If we are stroing the term text and the field strings as UTF-8 bytes, I now
understand that it is a tricky issue because of the performance problem we
are still facing when converting back and forth between the UTF-8 bytes and
java String. This especially seems to be a problem for the segment merger
routine, which loads the segment term enums and will convert the UTF-8 bytes
back to String during merge operation.

Just a thought here, could we always represent the term text as UTF-8 bytes
internally? So Term.java will have the private member variable:

private byte[] utf8bytes;

instead of

private String text;

Plus, Term object could be construct either from a String or from a utf8
byte array.

This way, for indexing new documents, the new Term(String text) is called
and utf8bytes will be obtained from the input term text. For segment term
info merge, the utf8bytes will be loaded from the Lucene index, which
already stores the term text as utf8 bytes. Therefore, no conversion is
needed.

I hope I explained my thoughts. Make sense?

Cheers,

Jian Chen

storing term text internally as byte array and bytecount as prefix, etc.

Reply via email to