Hi,

These 5 code points are "undefined" character in Cp1252. The first one
should be 0x81 not 0x83, since 0x83<->u_0192 is defined and works
correctly in Cp1252 charset). The mapping table you referred to is
"bestfit" type mapping table, in which it tries to provide the mapping
between the local encoding and the Unicode character set for those
characters not even exist in the local encoding. Personally I don't think
it's a good idea in most use scenario. All other official (from Microsoft)
or un-official mapping tables clearly mark these code points "undefined"
or "unused", for example

http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
http://en.wikipedia.org/wiki/Windows-1252
http://msdn.microsoft.com/en-us/library/cc195054.aspx

btw, code below is incorrect, or it does not work the way you might expect.

String name1 = new String( new String("兆源").getBytes("UTF-8"), "Cp1252");
String name2 = new String( name1.getBytes("Cp1252"), "UTF-8");

new String("兆源").getBytes("UTF-8") encodes your 2 Chinese character from
UTF-16 to UTF-8 bytes. It does not makes sense to then decode these UTF-8
bytes back to UTF-16 (which the String object uses) by using Cp1252 charset.

same for the second attempt.

What did you try to achieve? decode/encode between UTF-8 bytes and CP1252
bytes? It's not going to be a round-trip conversion for those non-ASCII characters.

-Sherman


On 09/01/2011 12:12 PM, Eric Liang wrote:
Hi all,
I've recently got an encoding error while using Cp1252 with UTF-8, the string converted from UTF-8 to Cp1252 can not be converted back:

    String name1 = new String( new String("兆源").getBytes("UTF-8"),
    "Cp1252");
    String name2 = new String( name1.getBytes("Cp1252"), "UTF-8");

It looks like that there are some incorrect codes in jdk on encoding Cp1252, and the related codes are:

    0x83    0x0192    ;Latin Small Letter F With Hook
    0x8d    0x008d
    0x8f    0x008f
    0x90    0x0090
    0x9d    0x009d

    ( from the Cp1252->UTF-8 map in
    
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1252.txt
    )

After I cloned the repository in http://hg.openjdk.java.net/jdk6/jdk6 and fix these codes in MS1252.java, the encoding error has gone.

I guess this is the right place to discuss this problem, and the patch is in the attachment. Anyone with any comment is appreciated.

Regards,
Eric
--
-----BEGIN GEEK CODE BLOCK-----
Version: 3.1
GCM/CS/E/MU/P d+(-) s: a- C++ UL$ P+>++ L++ E++ W++ N+ o+>++ K+++ w !O
M-(+) V-- PS+ PE+ Y+ PGP++ t? 5? X? R+>* tv@ b++++ DI-- D G++ e++>+++@ h*
r !y+
------END GEEK CODE BLOCK------

Reply via email to