Re: Different error decoding Shift-JIS sequence in JDK8

Seán Coffey Mon, 25 Nov 2013 02:10:08 -0800

Sherman can answer this best. The 8008386 fix for 8 differs from earlierupdates since alot of the code was rewritten in this area. The initialreport was identified as a regression in JDK6. Back in 2005, the 6227339fix changed behaviour which meant that invalid single byte characterswere treated incorrectly when decoding Shift_JIS encoded bytes. It meantthat two bytes are decoded to a "?" character rather than one. The validsingle byte characters are lost as a result and I believe this was allunintended when the 6227339 fix was made.

Changes made in 8008386 mean that the case of a malformed character(legal leading byte) followed by a valid single byte should now return areplacement character for the first malformed byte and a correctlydecoded single byte character.


regards,
Sean.

On 22/11/2013 13:20, Alan Bateman wrote:

On 22/11/2013 11:02, Charles Oliver Nutter wrote:

Apologies if this is not the correct place to post this, bit i18n
seemed more focused on languages and localization than the mechanics
of transcoding.

I have noticed a behavioral difference in JDK8 decoding a two-byte
Shift-JIS sequence. Specifically, JDK8 appears to report malformed
input for what should be a valid Shift-JIS sequence, where JDK7
reported that the character was unmappable.

I assume this is related to JDK-8008386 [1] and I'm sure Sherman orSean will jump in to explain this (which seems to be related to a longstanding regression).


-Alan

[1] https://bugs.openjdk.java.net/browse/JDK-8008386

Apologies if this is not the correct place to post this, bit i18n
seemed more focused on languages and localization than the mechanics
of transcoding.

I have noticed a behavioral difference in JDK8 decoding a two-byte
Shift-JIS sequence. Specifically, JDK8 appears to report malformed
input for what should be a valid Shift-JIS sequence, where JDK7
reported that the character was unmappable.

The code to reproduce is fairly simple:

byte[] bytes = {(byte)0xEF, 0x40};
CharsetDecoder decoder = Charset.forName("Shift-JIS").newDecoder();
System.out.println(decoder.decode(ByteBuffer.wrap(bytes),
CharBuffer.allocate(2), false));

Note that this is pumping the decoder directly and specifying partial
input (false). We use this mechanism in JRuby for transcoding
arbitrary byte[] from one encoding to another.

The result of running this on JDK7 is "UNMAPPABLE[2]" while the result
on JDK8 is "MALFORMED[1]".

Information online is spotty as to whether this sequence is valid. It
does appear on the table for [JIS X
203](http://x0213.org/codetable/sjis-0213-2004-std.txt) and several
articles on Shift-JIS claim that it is at worst undefined and at best
valid. So I'm leaning toward this being a bug in JDK8's Shift-JIS
decoder.

Note that on JDK7 it is "unmappable", which may mean this code
represents a character with no equivalent in Unicode.

I have uploaded my code to github here:
https://github.com/headius/jdk8_utf8_decoding_bug

Thoughts?

- Charlie

Re: Different error decoding Shift-JIS sequence in JDK8

Reply via email to