Hi Marvin,

I'm guessing that since I'm the one that cares most about interoperability, I'll have to volunteer to do the heavy lifting. Tomorrow I'll go through and survey how many and which things would need to change to achieve full UTF-8 compliance. One concern is that I think in order to make that last case work, readChars() may need to return an array. Since readChars() is part of the public API and may be called by something other than readString(), I don't know if that'll fly.

I don't believe such a change would be required, since the ultimate data source/destination on the Java side will look the same (array of Java chars) - the only issue is how it looks when serialized.

It seems clear that you have sufficient expertise to hone my rough contributions into final form. If you have the interest, would that be a good division of labor? I wish I could do this alone and just supply finished, tested patches, but obviously I can't. Or perhaps I'm underestimating your level of interest -- do you want to take the ball and run with it?

I can take a look at the code, sure. The hard part will be coding up the JUnit test cases (see below).

I think we could stand to have 2 corpuses of test documents available: one is which predominantly 2-byte and 3-byte UTF-8 (but no 4-byte), and another which has the full range including non-BMP code points. I can hunt those down or maybe get somebody from the Plucene community to create them, but perhaps they already exist?

Good test data for the decoder would be the following:

a. Single surrogate pair (two Java chars)
b. Surrogate pair at the beginning, followed by regular data.
c. Surrogate pair at the end, followed by regular data.
d. Two surrogate pairs in a row.

Then all of the above, but remove the second (low-order) surrogate character (busted format).

Then all of the above, but replace the first (high-order) surrogate character.

Then all of the above, but replace the surrogate pair with an xC0 x80 encoded null byte.

And no, I don't think this test data exists, unfortunately. But it shouldn't be too hard to generate.

-- Ken
--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to