This response confuses me.  Are you saying that the UTF8 encoder is not really 
producing UTF8?  RFC 2279 and 3629 both clearly state that surrogates must be 
combined to form a 32-bit value which is then encoded as a 4-byte sequence.  In 
fact, the RFCs refer to the alternate encoding CESU_8 definition which encodes 
each half of the surrogate pair as a 3-byte UTF-8 sequence.

I guess returning 3.0 for maxBytesPerChar works for the purpose of allocating a 
big enough byte buffer, but the explanation in this thread is confusing.

Tom Salter

------------------------------
Date: Tue, 23 Sep 2014 11:37:07 +0400
From: Ivan Gerasimov <ivan.gerasi...@oracle.com>
To: Xueming Shen <xueming.s...@oracle.com>,     Martin Buchholz
        <marti...@google.com>
Cc: nio-...@openjdk.java.net, core-libs-dev
        <core-libs-dev@openjdk.java.net>
Subject: Re: RFR [8058875]: CharsetEncoder.maxBytesPerChar() should
        return  4 for UTF-8
Message-ID: <54212323.5080...@oracle.com>
Content-Type: text/plain; charset=UTF-8; format=flowed

Martin, Sherman thanks for clarification!

Closing the bug as not a bug.

> The "character" in the nio Charset and CharDe/Encoder is specified as 
> "sixteen-bit Unicode
> code unit", so it is reasonable to interpret the "character" in the 
> "maximum number of bytes
> that will be produced for each character of input" to be the Java 
> "char" as well. In case of
> UTF8, each 4-byte form supplementary character is always coded into 2 
> surrogate chars,
> it's "2 byte per char".

> Do we have a real escalation that complains about this?
>
Yes, the link in on the bug page: 
https://bugs.openjdk.java.net/browse/JDK-8058875
I'm going to try to explain what I've just realized about this function :-)

Sincerely yours,
Ivan

Reply via email to