RE: RFR [8058875]: CharsetEncoder.maxBytesPerChar() should return 4 for UTF-8

Salter, Thomas A Tue, 23 Sep 2014 08:00:06 -0700

This response confuses me.  Are you saying that the UTF8 encoder is not really 
producing UTF8?  RFC 2279 and 3629 both clearly state that surrogates must be 
combined to form a 32-bit value which is then encoded as a 4-byte sequence.  In 
fact, the RFCs refer to the alternate encoding CESU_8 definition which encodes 
each half of the surrogate pair as a 3-byte UTF-8 sequence.


I guess returning 3.0 for maxBytesPerChar works for the purpose of allocating a 
big enough byte buffer, but the explanation in this thread is confusing.

Tom Salter

------------------------------
Date: Tue, 23 Sep 2014 11:37:07 +0400
From: Ivan Gerasimov <[email protected]>
To: Xueming Shen <[email protected]>,     Martin Buchholz
        <[email protected]>
Cc: [email protected], core-libs-dev
        <[email protected]>
Subject: Re: RFR [8058875]: CharsetEncoder.maxBytesPerChar() should
        return  4 for UTF-8
Message-ID: <[email protected]>
Content-Type: text/plain; charset=UTF-8; format=flowed

Martin, Sherman thanks for clarification!

Closing the bug as not a bug.

> The "character" in the nio Charset and CharDe/Encoder is specified as 
> "sixteen-bit Unicode
> code unit", so it is reasonable to interpret the "character" in the 
> "maximum number of bytes
> that will be produced for each character of input" to be the Java 
> "char" as well. In case of
> UTF8, each 4-byte form supplementary character is always coded into 2 
> surrogate chars,
> it's "2 byte per char".

> Do we have a real escalation that complains about this?
>
Yes, the link in on the bug page: 
https://bugs.openjdk.java.net/browse/JDK-8058875
I'm going to try to explain what I've just realized about this function :-)

Sincerely yours,
Ivan

RE: RFR [8058875]: CharsetEncoder.maxBytesPerChar() should return 4 for UTF-8

Reply via email to