Re: Maybe codec bug in MS1252, i.e., encoding Cp1252

Xueming Shen Thu, 01 Sep 2011 13:11:16 -0700

Hi,

These 5 code points are "undefined" character in Cp1252. The first one
should be 0x81 not 0x83, since 0x83<->u_0192 is defined and works
correctly in Cp1252 charset). The mapping table you referred to is
"bestfit" type mapping table, in which it tries to provide the mapping
between the local encoding and the Unicode character set for those
characters not even exist in the local encoding. Personally I don't think
it's a good idea in most use scenario. All other official (from Microsoft)
or un-official mapping tables clearly mark these code points "undefined"
or "unused", for example


http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
http://en.wikipedia.org/wiki/Windows-1252
http://msdn.microsoft.com/en-us/library/cc195054.aspx

btw, code below is incorrect, or it does not work the way you mightexpect.


String name1 = new String( new String("兆源").getBytes("UTF-8"), "Cp1252");
String name2 = new String( name1.getBytes("Cp1252"), "UTF-8");

new String("兆源").getBytes("UTF-8") encodes your 2 Chinese character from
UTF-16 to UTF-8 bytes. It does not makes sense to then decode these UTF-8
bytes back to UTF-16 (which the String object uses) by using Cp1252 charset.

same for the second attempt.

What did you try to achieve? decode/encode between UTF-8 bytes and CP1252

bytes? It's not going to be a round-trip conversion for those non-ASCIIcharacters.


-Sherman


On 09/01/2011 12:12 PM, Eric Liang wrote:

Hi all,
I've recently got an encoding error while using Cp1252 with UTF-8, thestring converted from UTF-8 to Cp1252 can not be converted back:
    String name1 = new String( new String("兆源").getBytes("UTF-8"),
    "Cp1252");
    String name2 = new String( name1.getBytes("Cp1252"), "UTF-8");
It looks like that there are some incorrect codes in jdk on encodingCp1252, and the related codes are:
    0x83    0x0192    ;Latin Small Letter F With Hook
    0x8d    0x008d
    0x8f    0x008f
    0x90    0x0090
    0x9d    0x009d

    ( from the Cp1252->UTF-8 map in
    
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1252.txt
    )
After I cloned the repository in http://hg.openjdk.java.net/jdk6/jdk6and fix these codes in MS1252.java, the encoding error has gone.
I guess this is the right place to discuss this problem, and the patchis in the attachment. Anyone with any comment is appreciated.
Regards,
Eric
--
-----BEGIN GEEK CODE BLOCK-----
Version: 3.1
GCM/CS/E/MU/P d+(-) s: a- C++ UL$ P+>++ L++ E++ W++ N+ o+>++ K+++ w !O
M-(+) V-- PS+ PE+ Y+ PGP++ t? 5? X? R+>* tv@ b++++ DI-- D G++ e++>+++@ h*
r !y+
------END GEEK CODE BLOCK------

Re: Maybe codec bug in MS1252, i.e., encoding Cp1252

Reply via email to