Re: Different error decoding Shift-JIS sequence in JDK8

Xueming Shen Fri, 29 Nov 2013 11:23:24 -0800

Hi Charles,

My apology for the late response. I was on vacation the past week anddid not have full email

access.

As Sean pointed out, this is triggered by the change we just put inrecently for 8008386, in whichtried to address a strong request that asked for case like 'fe' '40' tobe treated as 1 malformedbyte + a mappable ascii 40. The reasoning appears to be in case likethis, the decoder shouldassume the first byte "fe" is incorrectly transferred duringcommunication..., treating them asa pair causes valuable information, the next byte, get dropped. And thiswas a regression of

jdk6 (from jdk5).

As a matter of fact, the reason we made the change in jdk6 was becauseof a similar caseof your use scenario:-( So it appears we are between a rock and a hardwall...

That said, I have to admitted in case of fe 40, it might be morereasonable to treat it asunmappable-2-byte, in stead of a malformed leading byte followed by amappable ascii.

I need to take a little more time to review the whole situation and seeif we can have some

compromise here.

Btw, if would be helpful if you can provide a little more detailsregarding your use scenario,

as you mentioned in your email.

"We use this mechanism in JRuby for transcoding arbitrary byte[] from one
encoding to another."

Thanks!
-Sherman

On 11/28/13 1:31 AM, Charles Oliver Nutter wrote:

What incantation is required to get Sherman to offer his perspective? :-)

I accept that it may be after Thanksgiving, but I need to know the
situation since we have tests for JRuby that depended on this
character being valid Shift-JIS.

- Charlie

On Mon, Nov 25, 2013 at 4:08 AM, Seán Coffey <[email protected]> wrote:

Sherman can answer this best. The 8008386 fix for 8 differs from earlier
updates since alot of the code was rewritten in this area. The initial
report was identified as a regression in JDK6. Back in 2005, the 6227339 fix
changed behaviour which meant that invalid single byte characters were
treated incorrectly when decoding Shift_JIS encoded bytes. It meant that two
bytes are decoded to a "?" character rather than one. The valid single byte
characters are lost as a result and I believe this was all unintended when
the 6227339 fix was made.

Changes made in 8008386 mean that the case of a malformed character (legal
leading byte) followed by a valid single byte should now return a
replacement character for the first malformed byte and a correctly decoded
single byte character.

regards,
Sean.


On 22/11/2013 13:20, Alan Bateman wrote:

On 22/11/2013 11:02, Charles Oliver Nutter wrote:

Apologies if this is not the correct place to post this, bit i18n
seemed more focused on languages and localization than the mechanics
of transcoding.

I have noticed a behavioral difference in JDK8 decoding a two-byte
Shift-JIS sequence. Specifically, JDK8 appears to report malformed
input for what should be a valid Shift-JIS sequence, where JDK7
reported that the character was unmappable.

I assume this is related to JDK-8008386 [1] and I'm sure Sherman or Sean
will jump in to explain this (which seems to be related to a long standing
regression).

-Alan

[1] https://bugs.openjdk.java.net/browse/JDK-8008386

Apologies if this is not the correct place to post this, bit i18n
seemed more focused on languages and localization than the mechanics
of transcoding.

I have noticed a behavioral difference in JDK8 decoding a two-byte
Shift-JIS sequence. Specifically, JDK8 appears to report malformed
input for what should be a valid Shift-JIS sequence, where JDK7
reported that the character was unmappable.

The code to reproduce is fairly simple:

byte[] bytes = {(byte)0xEF, 0x40};
CharsetDecoder decoder = Charset.forName("Shift-JIS").newDecoder();
System.out.println(decoder.decode(ByteBuffer.wrap(bytes),
CharBuffer.allocate(2), false));

Note that this is pumping the decoder directly and specifying partial
input (false). We use this mechanism in JRuby for transcoding
arbitrary byte[] from one encoding to another.

The result of running this on JDK7 is "UNMAPPABLE[2]" while the result
on JDK8 is "MALFORMED[1]".

Information online is spotty as to whether this sequence is valid. It
does appear on the table for [JIS X
203](http://x0213.org/codetable/sjis-0213-2004-std.txt) and several
articles on Shift-JIS claim that it is at worst undefined and at best
valid. So I'm leaning toward this being a bug in JDK8's Shift-JIS
decoder.

Note that on JDK7 it is "unmappable", which may mean this code
represents a character with no equivalent in Unicode.

I have uploaded my code to github here:
https://github.com/headius/jdk8_utf8_decoding_bug

Thoughts?

- Charlie

Re: Different error decoding Shift-JIS sequence in JDK8

Reply via email to