Codereview request for 7082884: Incorrect UTF8 conversion for sequence ED 31

Xueming Shen Mon, 19 Sep 2011 13:22:33 -0700

Hi,

Unicode Standard added "Addition Constraints on conversion of ill-formedUTF-8"in version 5.1 [1] and updated in 6.0 again with further "clarification"[2] regarding

how a "conformance" implementation should handle ill-formed UTF-8 byte
sequence. Basically it says

(1) the conversion process should not interprets any ill-formed codeunit sequence*(2) such process must not treat any adjacent well-formed code unitsequences

     as being part of those ill-formed code unit sequences

(3) and recommend a "best practice" of "maximal valid subpart" forreplacement

The new UTF-8 charset implementation we put in JDK7 (and back-ported topreviousrelease since then) follows the new constraints in most cases. Except 2corner cases

(we are aware of so far for now"). 7082884 is one of them.

The current implementation decode

new String(new byte[]{(byte)0xed, 31}, "UTF8")

into one single char "\ufffd".

while it should be "\ufffd\u001f" instead, according to the newconstraints (the first0xed is ill-formed, and the second 0x1f is still valid non-ill-formed,so it should notbe consumed, even the first byte 0xed is a leading byte a three-byteutf-8 byte

sequence).

The reason I called it a "corner case" is because then new UTF-8implementation handles

it correctly in most cases, for example

new String(new byte[]{(byte)0xed, 31, 'a'}, "UTF8");

does return the expected result "\ufffd\u001f\u0061"

The corner case here is that the 0xed is the leading byte of athree-byte utf-8 byte sequence,but we actually only 2 bytes total in pipe. The current UTF-8 decoderimplementationwill not even look into the following bytes when it has a leading byteof a 3-byte utf-8sequence and it has less than 3 bytes to work on. In this case it simplyreturns "underflow",means "I need more bytes". Unfortunately its upper level implementation,CharsetDecoder,will simply treat this "underflow" status as a malformed byte sequenceof "2" (it's reasonablefor CharsetDecoder to make such a decision as well, see the decoder doesnot have enoughbytes to handle these 2 bytes, and we don't have any more bytesfollowing, so the rest are

all "malformed").

The fix is to further look into the following bytes when we have aleading byte, even don't

have enough bytes to complete the conversion.

The webrev is at

http://cr.openjdk.java.net/~sherman/7082884/webrev

Another corner case is how to deal with the old 5-6 bytes byte sequence,such as"fc 80 80 8f bf bf", we are now treating them as 1 malformed utf-8 bytesequence, so anyof these 5-6 bytes "old" formed will be treated one malformed characterand then replacedby one "\ufffd". But according to the new "best practice"recommendation, it probably shouldbe replaced by 6 \ufffd. if I understand the recommendation correctly.Personally I feel the

existing implementation is a more reasonable approach, opinion?

Thanks
-Sherman

[1] http://www.unicode.org/versions/Unicode5.1.0/#Notable_Changes
[2] http://www.unicode.org/versions/Unicode6.0.0/#Conformance_Changes
[3] http://blogs.oracle.com/xuemingshen/entry/the_big_overhaul_of_java

Codereview request for 7082884: Incorrect UTF8 conversion for sequence ED 31

Reply via email to