Hi, All, Recently I have been following through the whole discussion on storing text/string as standard UTF-8 and how to achieve that in Lucene.
If we are stroing the term text and the field strings as UTF-8 bytes, I now understand that it is a tricky issue because of the performance problem we are still facing when converting back and forth between the UTF-8 bytes and java String. This especially seems to be a problem for the segment merger routine, which loads the segment term enums and will convert the UTF-8 bytes back to String during merge operation. Just a thought here, could we always represent the term text as UTF-8 bytes internally? So Term.java will have the private member variable: private byte[] utf8bytes; instead of private String text; Plus, Term object could be construct either from a String or from a utf8 byte array. This way, for indexing new documents, the new Term(String text) is called and utf8bytes will be obtained from the input term text. For segment term info merge, the utf8bytes will be loaded from the Lucene index, which already stores the term text as utf8 bytes. Therefore, no conversion is needed. I hope I explained my thoughts. Make sense? Cheers, Jian Chen