Hi Charles,
My apology for the late response. I was on vacation the past week and
did not have full email
access.
As Sean pointed out, this is triggered by the change we just put in
recently for 8008386, in which
tried to address a strong request that asked for case like 'fe' '40' to
be treated as 1 malformed
byte + a mappable ascii 40. The reasoning appears to be in case like
this, the decoder should
assume the first byte "fe" is incorrectly transferred during
communication..., treating them as
a pair causes valuable information, the next byte, get dropped. And this
was a regression of
jdk6 (from jdk5).
As a matter of fact, the reason we made the change in jdk6 was because
of a similar case
of your use scenario:-( So it appears we are between a rock and a hard
wall...
That said, I have to admitted in case of fe 40, it might be more
reasonable to treat it as
unmappable-2-byte, in stead of a malformed leading byte followed by a
mappable ascii.
I need to take a little more time to review the whole situation and see
if we can have some
compromise here.
Btw, if would be helpful if you can provide a little more details
regarding your use scenario,
as you mentioned in your email.
"We use this mechanism in JRuby for transcoding arbitrary byte[] from one
encoding to another."
Thanks!
-Sherman
On 11/28/13 1:31 AM, Charles Oliver Nutter wrote:
What incantation is required to get Sherman to offer his perspective? :-)
I accept that it may be after Thanksgiving, but I need to know the
situation since we have tests for JRuby that depended on this
character being valid Shift-JIS.
- Charlie
On Mon, Nov 25, 2013 at 4:08 AM, Seán Coffey <sean.cof...@oracle.com> wrote:
Sherman can answer this best. The 8008386 fix for 8 differs from earlier
updates since alot of the code was rewritten in this area. The initial
report was identified as a regression in JDK6. Back in 2005, the 6227339 fix
changed behaviour which meant that invalid single byte characters were
treated incorrectly when decoding Shift_JIS encoded bytes. It meant that two
bytes are decoded to a "?" character rather than one. The valid single byte
characters are lost as a result and I believe this was all unintended when
the 6227339 fix was made.
Changes made in 8008386 mean that the case of a malformed character (legal
leading byte) followed by a valid single byte should now return a
replacement character for the first malformed byte and a correctly decoded
single byte character.
regards,
Sean.
On 22/11/2013 13:20, Alan Bateman wrote:
On 22/11/2013 11:02, Charles Oliver Nutter wrote:
Apologies if this is not the correct place to post this, bit i18n
seemed more focused on languages and localization than the mechanics
of transcoding.
I have noticed a behavioral difference in JDK8 decoding a two-byte
Shift-JIS sequence. Specifically, JDK8 appears to report malformed
input for what should be a valid Shift-JIS sequence, where JDK7
reported that the character was unmappable.
I assume this is related to JDK-8008386 [1] and I'm sure Sherman or Sean
will jump in to explain this (which seems to be related to a long standing
regression).
-Alan
[1] https://bugs.openjdk.java.net/browse/JDK-8008386
Apologies if this is not the correct place to post this, bit i18n
seemed more focused on languages and localization than the mechanics
of transcoding.
I have noticed a behavioral difference in JDK8 decoding a two-byte
Shift-JIS sequence. Specifically, JDK8 appears to report malformed
input for what should be a valid Shift-JIS sequence, where JDK7
reported that the character was unmappable.
The code to reproduce is fairly simple:
byte[] bytes = {(byte)0xEF, 0x40};
CharsetDecoder decoder = Charset.forName("Shift-JIS").newDecoder();
System.out.println(decoder.decode(ByteBuffer.wrap(bytes),
CharBuffer.allocate(2), false));
Note that this is pumping the decoder directly and specifying partial
input (false). We use this mechanism in JRuby for transcoding
arbitrary byte[] from one encoding to another.
The result of running this on JDK7 is "UNMAPPABLE[2]" while the result
on JDK8 is "MALFORMED[1]".
Information online is spotty as to whether this sequence is valid. It
does appear on the table for [JIS X
203](http://x0213.org/codetable/sjis-0213-2004-std.txt) and several
articles on Shift-JIS claim that it is at worst undefined and at best
valid. So I'm leaning toward this being a bug in JDK8's Shift-JIS
decoder.
Note that on JDK7 it is "unmappable", which may mean this code
represents a character with no equivalent in Unicode.
I have uploaded my code to github here:
https://github.com/headius/jdk8_utf8_decoding_bug
Thoughts?
- Charlie