Eric Hatcher wrote...
What, if any, performance impact would changing Java Lucene in this regard have?
And Ken Krugler wrote...
"Lucene writes strings as a VInt representing the length of the string in Java chars (UTF-16 code units), followed by the character data."
I had been working under the assumption that the value of the VInt would be changed as well. It seemed logical that if strings were encoded with legal UTF-8, the count at the head should indicate either 1) the number of UTF-8 characters in the string, or 2) the number of bytes occupied by the encoded string.
Do either of those and more substantial changes to Java Lucene would be required. I expect that the impact on performance could be made negligible for the first option, but the question of backwards compatibility would become a lot messier.
It simply had not occurred to me to keep the VInt as is. If you do that, this becomes a much more localized problem.
For Plucene, I'll avoid the gory details and just say that having the VInt continue to represent UTF-16 code units limits the availability of certain options, but doesn't cause major inefficiencies. Now that we know that's what it does, we can work with it. A transition to always-legal UTF-8 obviates the need to scan for and fix the edge cases, and addresses my main concern.
Marvin Humphrey Rectangular Research http://www.rectangular.com/ --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]