Re: Lucene and UTF-8

Marvin Humphrey Tue, 20 Sep 2005 19:27:01 -0700

Hello again,

I've prepared a patch for IndexInput.java, and an accompanying patchfor TestIndexInput.java. I figured I would submit them fordiscussion here before filing them via Jira. The patches areattached to this email; if I find that they get stripped by thelistserv, I'll post them on a website.

The patch to IndexInput.java makes it capable of decoding bothmodified UTF-8 and valid UTF-8, so backwards compatibility ispreserved. I'll have another patch for IndexOutput.java soon, butIndexInput.java doesn't have to wait for it.

A crude benchmarking app I already have set up (it just builds anindex with 1000 docs) seems to support my expectation: this change toIndexInput should have little or no impact on speed with western,mostly-ascii text. It might actually be a smidgen faster with textwhich is mostly multi-byte UTF-8, since an if-else-if chain withcalculations within conditionals has been replaced by a switch basedon a lookup table. The only real cost for this patch is the memoryhit for loading the 248-byte lookup table.

My local copy of trunk revision 590297 passes all tests with thesepatches, except for TestIndexModifier which fails regardless.


Ken Krugler wrote...

Good test data for the decoder would be the following:

a. Single surrogate pair (two Java chars)
b. Surrogate pair at the beginning, followed by regular data.
c. Surrogate pair at the end, followed by regular data.
d. Two surrogate pairs in a row.

I've selected U+1D11E "MUSICAL SYMBOL G CLEF" and U+1D160 "MUSICALSYMBOL EIGHTH NOTE" as the non-BMP code points of choice.


http://www.fileformat.info/info/unicode/char/01d11e/index.htm
http://www.fileformat.info/info/unicode/char/01d160/index.htm

It might be my quadranoia acting up again, but it seemed like a goodidea to add another test case, since UTF-8 is a stateful encoding(within a short span):


e. A string with two embedded surrogate pairs.

"Lu\uD834\uDD1Ece\uD834\uDD60ne"

Then all of the above, but remove the second (low-order) surrogatecharacter (busted format).
Then all of the above, but replace the first (high-order) surrogatecharacter.

These are interesting. Lucene isn't equipped for detection/correction of invalid Unicode when reading its own index files, andimplementing such capabilities would impose a performance penalty.The assumption is that Lucene will always read its own files and thatthose files will never contain corrupt data. Debatable, but itdoesn't seem to have caused problems up till now.

Since there's no way to check if IndexInput catches invalid input,I've skipped these two cases -- but I'll put them in my upcomingIndexOutput patches, which is I think what you intended anyway.

Then all of the above, but replace the surrogate pair with an xC0x80 encoded null byte.


Done.

Three more test batches seemed appropriate.

Cases for the \x00 null, which would previously have been interpretedincorrectly as the start of a 3-byte UTF-8 sequence.


Cases for two-byte UTF-8, using U+00BF "INVERTED QUESTION MARK".
http://www.fileformat.info/info/unicode/char/00bf/index.htm

Cases for three-byte UTF-8, using U+2620 "SKULL AND CROSSBONES".
http://www.fileformat.info/info/unicode/char/2620/index.htm

Previously, there was only a test for the string "Lucene".

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene and UTF-8

Reply via email to