Re: Lucene does NOT use UTF-8

Marvin Humphrey Mon, 29 Aug 2005 12:16:54 -0700


Eric Hatcher wrote...

What, if any, performance impact would changing Java Lucene in thisregard have?


And Ken Krugler wrote...

"Lucene writes strings as a VInt representing the length of thestring in Java chars (UTF-16 code units), followed by the characterdata."

I had been working under the assumption that the value of the VIntwould be changed as well. It seemed logical that if strings wereencoded with legal UTF-8, the count at the head should indicateeither 1) the number of UTF-8 characters in the string, or 2) thenumber of bytes occupied by the encoded string.

Do either of those and more substantial changes to Java Lucene wouldbe required. I expect that the impact on performance could be madenegligible for the first option, but the question of backwardscompatibility would become a lot messier.

It simply had not occurred to me to keep the VInt as is. If you dothat, this becomes a much more localized problem.

For Plucene, I'll avoid the gory details and just say that having theVInt continue to represent UTF-16 code units limits the availabilityof certain options, but doesn't cause major inefficiencies. Now thatwe know that's what it does, we can work with it. A transition toalways-legal UTF-8 obviates the need to scan for and fix the edgecases, and addresses my main concern.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene does NOT use UTF-8

Reply via email to