Hi,

Unicode Standard added "Addition Constraints on conversion of ill-formed UTF-8" in version 5.1 [1] and updated in 6.0 again with further "clarification" [2] regarding
how a "conformance" implementation should handle ill-formed UTF-8 byte
sequence. Basically it says

(1) the conversion process should not interprets any ill-formed code unit sequence *(2) such process must not treat any adjacent well-formed code unit sequences
     as being part of those ill-formed code unit sequences
(3) and recommend a "best practice" of "maximal valid subpart" for replacement

The new UTF-8 charset implementation we put in JDK7 (and back-ported to previous release since then) follows the new constraints in most cases. Except 2 corner cases
(we are aware of so far for now"). 7082884 is one of them.

The current implementation decode

new String(new byte[]{(byte)0xed, 31}, "UTF8")

into one single char "\ufffd".

while it should be "\ufffd\u001f" instead, according to the new constraints (the first 0xed is ill-formed, and the second 0x1f is still valid non-ill-formed, so it should not be consumed, even the first byte 0xed is a leading byte a three-byte utf-8 byte
sequence).

The reason I called it a "corner case" is because then new UTF-8 implementation handles
it correctly in most cases, for example

new String(new byte[]{(byte)0xed, 31, 'a'}, "UTF8");

does return the expected result "\ufffd\u001f\u0061"

The corner case here is that the 0xed is the leading byte of a three-byte utf-8 byte sequence, but we actually only 2 bytes total in pipe. The current UTF-8 decoder implementation will not even look into the following bytes when it has a leading byte of a 3-byte utf-8 sequence and it has less than 3 bytes to work on. In this case it simply returns "underflow", means "I need more bytes". Unfortunately its upper level implementation, CharsetDecoder, will simply treat this "underflow" status as a malformed byte sequence of "2" (it's reasonable for CharsetDecoder to make such a decision as well, see the decoder does not have enough bytes to handle these 2 bytes, and we don't have any more bytes following, so the rest are
all "malformed").

The fix is to further look into the following bytes when we have a leading byte, even don't
have enough bytes to complete the conversion.

The webrev is at

http://cr.openjdk.java.net/~sherman/7082884/webrev

Another corner case is how to deal with the old 5-6 bytes byte sequence, such as "fc 80 80 8f bf bf", we are now treating them as 1 malformed utf-8 byte sequence, so any of these 5-6 bytes "old" formed will be treated one malformed character and then replaced by one "\ufffd". But according to the new "best practice" recommendation, it probably should be replaced by 6 \ufffd. if I understand the recommendation correctly. Personally I feel the
existing implementation is a more reasonable approach, opinion?

Thanks
-Sherman

[1] http://www.unicode.org/versions/Unicode5.1.0/#Notable_Changes
[2] http://www.unicode.org/versions/Unicode6.0.0/#Conformance_Changes
[3] http://blogs.oracle.com/xuemingshen/entry/the_big_overhaul_of_java

Reply via email to