What incantation is required to get Sherman to offer his perspective? :-) I accept that it may be after Thanksgiving, but I need to know the situation since we have tests for JRuby that depended on this character being valid Shift-JIS.
- Charlie On Mon, Nov 25, 2013 at 4:08 AM, Seán Coffey <sean.cof...@oracle.com> wrote: > Sherman can answer this best. The 8008386 fix for 8 differs from earlier > updates since alot of the code was rewritten in this area. The initial > report was identified as a regression in JDK6. Back in 2005, the 6227339 fix > changed behaviour which meant that invalid single byte characters were > treated incorrectly when decoding Shift_JIS encoded bytes. It meant that two > bytes are decoded to a "?" character rather than one. The valid single byte > characters are lost as a result and I believe this was all unintended when > the 6227339 fix was made. > > Changes made in 8008386 mean that the case of a malformed character (legal > leading byte) followed by a valid single byte should now return a > replacement character for the first malformed byte and a correctly decoded > single byte character. > > regards, > Sean. > > > On 22/11/2013 13:20, Alan Bateman wrote: >> >> On 22/11/2013 11:02, Charles Oliver Nutter wrote: >>> >>> Apologies if this is not the correct place to post this, bit i18n >>> seemed more focused on languages and localization than the mechanics >>> of transcoding. >>> >>> I have noticed a behavioral difference in JDK8 decoding a two-byte >>> Shift-JIS sequence. Specifically, JDK8 appears to report malformed >>> input for what should be a valid Shift-JIS sequence, where JDK7 >>> reported that the character was unmappable. >> >> I assume this is related to JDK-8008386 [1] and I'm sure Sherman or Sean >> will jump in to explain this (which seems to be related to a long standing >> regression). >> >> -Alan >> >> [1] https://bugs.openjdk.java.net/browse/JDK-8008386 > > >> Apologies if this is not the correct place to post this, bit i18n >> seemed more focused on languages and localization than the mechanics >> of transcoding. >> >> I have noticed a behavioral difference in JDK8 decoding a two-byte >> Shift-JIS sequence. Specifically, JDK8 appears to report malformed >> input for what should be a valid Shift-JIS sequence, where JDK7 >> reported that the character was unmappable. >> >> The code to reproduce is fairly simple: >> >> byte[] bytes = {(byte)0xEF, 0x40}; >> CharsetDecoder decoder = Charset.forName("Shift-JIS").newDecoder(); >> System.out.println(decoder.decode(ByteBuffer.wrap(bytes), >> CharBuffer.allocate(2), false)); >> >> Note that this is pumping the decoder directly and specifying partial >> input (false). We use this mechanism in JRuby for transcoding >> arbitrary byte[] from one encoding to another. >> >> The result of running this on JDK7 is "UNMAPPABLE[2]" while the result >> on JDK8 is "MALFORMED[1]". >> >> Information online is spotty as to whether this sequence is valid. It >> does appear on the table for [JIS X >> 203](http://x0213.org/codetable/sjis-0213-2004-std.txt) and several >> articles on Shift-JIS claim that it is at worst undefined and at best >> valid. So I'm leaning toward this being a bug in JDK8's Shift-JIS >> decoder. >> >> Note that on JDK7 it is "unmappable", which may mean this code >> represents a character with no equivalent in Unicode. >> >> I have uploaded my code to github here: >> https://github.com/headius/jdk8_utf8_decoding_bug >> >> Thoughts? >> >> - Charlie > > >