Hi Marvin,
I'm guessing that since I'm the one that cares most about
interoperability, I'll have to volunteer to do the heavy lifting.
Tomorrow I'll go through and survey how many and which things would
need to change to achieve full UTF-8 compliance. One concern is
that I think in order to make that last case work, readChars() may
need to return an array. Since readChars() is part of the public
API and may be called by something other than readString(), I don't
know if that'll fly.
I don't believe such a change would be required, since the ultimate
data source/destination on the Java side will look the same (array of
Java chars) - the only issue is how it looks when serialized.
It seems clear that you have sufficient expertise to hone my rough
contributions into final form. If you have the interest, would that
be a good division of labor? I wish I could do this alone and just
supply finished, tested patches, but obviously I can't. Or perhaps
I'm underestimating your level of interest -- do you want to take
the ball and run with it?
I can take a look at the code, sure. The hard part will be coding up
the JUnit test cases (see below).
I think we could stand to have 2 corpuses of test documents
available: one is which predominantly 2-byte and 3-byte UTF-8 (but
no 4-byte), and another which has the full range including non-BMP
code points. I can hunt those down or maybe get somebody from the
Plucene community to create them, but perhaps they already exist?
Good test data for the decoder would be the following:
a. Single surrogate pair (two Java chars)
b. Surrogate pair at the beginning, followed by regular data.
c. Surrogate pair at the end, followed by regular data.
d. Two surrogate pairs in a row.
Then all of the above, but remove the second (low-order) surrogate
character (busted format).
Then all of the above, but replace the first (high-order) surrogate character.
Then all of the above, but replace the surrogate pair with an xC0 x80
encoded null byte.
And no, I don't think this test data exists, unfortunately. But it
shouldn't be too hard to generate.
-- Ken
--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]