Hello again,
I've prepared a patch for IndexInput.java, and an accompanying patch
for TestIndexInput.java. I figured I would submit them for
discussion here before filing them via Jira. The patches are
attached to this email; if I find that they get stripped by the
listserv, I'll post them on a website.
The patch to IndexInput.java makes it capable of decoding both
modified UTF-8 and valid UTF-8, so backwards compatibility is
preserved. I'll have another patch for IndexOutput.java soon, but
IndexInput.java doesn't have to wait for it.
A crude benchmarking app I already have set up (it just builds an
index with 1000 docs) seems to support my expectation: this change to
IndexInput should have little or no impact on speed with western,
mostly-ascii text. It might actually be a smidgen faster with text
which is mostly multi-byte UTF-8, since an if-else-if chain with
calculations within conditionals has been replaced by a switch based
on a lookup table. The only real cost for this patch is the memory
hit for loading the 248-byte lookup table.
My local copy of trunk revision 590297 passes all tests with these
patches, except for TestIndexModifier which fails regardless.
Ken Krugler wrote...
Good test data for the decoder would be the following:
a. Single surrogate pair (two Java chars)
b. Surrogate pair at the beginning, followed by regular data.
c. Surrogate pair at the end, followed by regular data.
d. Two surrogate pairs in a row.
I've selected U+1D11E "MUSICAL SYMBOL G CLEF" and U+1D160 "MUSICAL
SYMBOL EIGHTH NOTE" as the non-BMP code points of choice.
http://www.fileformat.info/info/unicode/char/01d11e/index.htm
http://www.fileformat.info/info/unicode/char/01d160/index.htm
It might be my quadranoia acting up again, but it seemed like a good
idea to add another test case, since UTF-8 is a stateful encoding
(within a short span):
e. A string with two embedded surrogate pairs.
"Lu\uD834\uDD1Ece\uD834\uDD60ne"
Then all of the above, but remove the second (low-order) surrogate
character (busted format).
Then all of the above, but replace the first (high-order) surrogate
character.
These are interesting. Lucene isn't equipped for detection/
correction of invalid Unicode when reading its own index files, and
implementing such capabilities would impose a performance penalty.
The assumption is that Lucene will always read its own files and that
those files will never contain corrupt data. Debatable, but it
doesn't seem to have caused problems up till now.
Since there's no way to check if IndexInput catches invalid input,
I've skipped these two cases -- but I'll put them in my upcoming
IndexOutput patches, which is I think what you intended anyway.
Then all of the above, but replace the surrogate pair with an xC0
x80 encoded null byte.
Done.
Three more test batches seemed appropriate.
Cases for the \x00 null, which would previously have been interpreted
incorrectly as the start of a 3-byte UTF-8 sequence.
Cases for two-byte UTF-8, using U+00BF "INVERTED QUESTION MARK".
http://www.fileformat.info/info/unicode/char/00bf/index.htm
Cases for three-byte UTF-8, using U+2620 "SKULL AND CROSSBONES".
http://www.fileformat.info/info/unicode/char/2620/index.htm
Previously, there was only a test for the string "Lucene".
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]