Hello again,

I've prepared a patch for IndexInput.java, and an accompanying patch for TestIndexInput.java. I figured I would submit them for discussion here before filing them via Jira. The patches are attached to this email; if I find that they get stripped by the listserv, I'll post them on a website.

The patch to IndexInput.java makes it capable of decoding both modified UTF-8 and valid UTF-8, so backwards compatibility is preserved. I'll have another patch for IndexOutput.java soon, but IndexInput.java doesn't have to wait for it.

A crude benchmarking app I already have set up (it just builds an index with 1000 docs) seems to support my expectation: this change to IndexInput should have little or no impact on speed with western, mostly-ascii text. It might actually be a smidgen faster with text which is mostly multi-byte UTF-8, since an if-else-if chain with calculations within conditionals has been replaced by a switch based on a lookup table. The only real cost for this patch is the memory hit for loading the 248-byte lookup table.

My local copy of trunk revision 590297 passes all tests with these patches, except for TestIndexModifier which fails regardless.

Ken Krugler wrote...

Good test data for the decoder would be the following:

a. Single surrogate pair (two Java chars)
b. Surrogate pair at the beginning, followed by regular data.
c. Surrogate pair at the end, followed by regular data.
d. Two surrogate pairs in a row.

I've selected U+1D11E "MUSICAL SYMBOL G CLEF" and U+1D160 "MUSICAL SYMBOL EIGHTH NOTE" as the non-BMP code points of choice.

http://www.fileformat.info/info/unicode/char/01d11e/index.htm
http://www.fileformat.info/info/unicode/char/01d160/index.htm

It might be my quadranoia acting up again, but it seemed like a good idea to add another test case, since UTF-8 is a stateful encoding (within a short span):

e. A string with two embedded surrogate pairs.

"Lu\uD834\uDD1Ece\uD834\uDD60ne"

Then all of the above, but remove the second (low-order) surrogate character (busted format).

Then all of the above, but replace the first (high-order) surrogate character.

These are interesting. Lucene isn't equipped for detection/ correction of invalid Unicode when reading its own index files, and implementing such capabilities would impose a performance penalty. The assumption is that Lucene will always read its own files and that those files will never contain corrupt data. Debatable, but it doesn't seem to have caused problems up till now.

Since there's no way to check if IndexInput catches invalid input, I've skipped these two cases -- but I'll put them in my upcoming IndexOutput patches, which is I think what you intended anyway.

Then all of the above, but replace the surrogate pair with an xC0 x80 encoded null byte.

Done.

Three more test batches seemed appropriate.

Cases for the \x00 null, which would previously have been interpreted incorrectly as the start of a 3-byte UTF-8 sequence.

Cases for two-byte UTF-8, using U+00BF "INVERTED QUESTION MARK".
http://www.fileformat.info/info/unicode/char/00bf/index.htm

Cases for three-byte UTF-8, using U+2620 "SKULL AND CROSSBONES".
http://www.fileformat.info/info/unicode/char/2620/index.htm

Previously, there was only a test for the string "Lucene".

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to