Codereview request for 7096080: UTF8 update and new CESU-8 charset

Xueming Shen Wed, 28 Sep 2011 12:17:54 -0700

Hi,

[I combined the proposed charge for #7082884, in which no one appears to be
interested:-) into this one]

Unicode Standard added "Addition Constraints on conversion of ill-formedUTF-8"in version 5.1 [1] and updated in 6.0 again with further "clarification"[2] regarding

how a "conformance" implementation should handle ill-formed UTF-8 byte
sequence. Basically it says

(1) the conversion process should not interpret any ill-formed codeunit sequence(2) such process must not treat any adjacent well-formed code unitsequences

     as being part of those ill-formed code unit sequences

(3) and recommend a "best practice" of "maximal valid sub-part" forreplacement

The new UTF-8 charset implementation we put in JDK7 (and back-ported toprevious

release since then) follows the new constraints in most cases, except

(1) The decoder still accepts "historical" 3 bytes surrogates and 6bytes surrogatepair (the encoder never outputs such sequence). Unicode Standard"tightened" its

UTF-8 definition in ver 3.2 [3], as

"Most notable among the corrigenda to the Standard is a furthertighteningof the definition of UTF-8, to eliminate irregular UTF-8 and tobring the

     Unicode specification of UTF-8 more completely into line with other
     specifications of UTF-8."

So the 3-byte/6-byte surrogates are now defined as "ill-formed" code unit
sequence, instead of "irregular" [5] in ver 3.1

(2) While no longer accepting the "historical" 5-byte, 6-byte UTF-8 bytesequence,the decoder treats these 5/6-byte sequence as ONE malformed unit. As aresult

these bytes get replaced by one replacement character, when "replace for

malformed" is desirable (as in new String(bytes), for example).According thelatest Unicode standard [2], however, because the leading byte of these5/6-bytesequence is no longer an illegal appearance of the UTF-8, these bytesshould be

treated as 5-6 individual ill-formed bytes.

(3)Corner case like ill-formed byte sequence ED 31 is not handled correctly/
consistently, as described in #7082884 [6]

The reason behind (1) and (2) is mostly the compatibility concern. Assuggestedin TR#26 [4] (in which it defines CESU-8, a separate UTF encoding schemethatuses 3-6-byte sequence for supplementary characters, instead of 4-bytesequencein UTF-8), there are apps/data over there that do use surrogates pair in"UTF-8"

form. To change the UTF-8 charset to follow standard obviously will break
someone's code when they migrate/upgrade from JDK/JRE N to N+1, something
we try really hard to avoid.

That said, given almost decade has passed and we are now at Unicode 6, Ithink

the possibility of breaking someone's code/date of upgrading UTF-8 to do the
"right thing" is small/minor. So I proposed here

(1) to upgrade the JDK8 UTF-8 implementation to strictly follow thestandard to

     a) reject 3-byte surrogate/6-byte surrogate pair
     b) treats 5/6-byte surrogate as individual ill-formed bytes.
     c) fix the corner case bug #7082884
(2) to add CESU-8 charset into JDK/JRE's charset repository (for those still
    prefer/work on 3-6 bytes surrogate, in "UTF-8" form)

Here is the webrev. The change will need to go through some "in-compatible
change" review process, but I think we can start the code review/discussion
here first.

http://cr.openjdk.java.net/~sherman/7096080/webrev/<http://cr.openjdk.java.net/%7Esherman/7096080/webrev/>


-Sherman

[1] http://www.unicode.org/versions/Unicode5.1.0/#Notable_Changes
[2] http://www.unicode.org/versions/Unicode6.0.0/#Conformance_Changes
[3] http://www.unicode.org/reports/tr28/tr28-3.html
[4] http://unicode.org/reports/tr26/
[5] http://unicode.org/versions/corrigendum1.html

[6]http://mail.openjdk.java.net/pipermail/core-libs-dev/2011-September/007722.html

Codereview request for 7096080: UTF8 update and new CESU-8 charset

Reply via email to