On 09/02/2011 02:14 AM, Eric Liang wrote:
On 09/02/2011 04:04 AM, Xueming Shen wrote:
Hi,

These 5 code points are "undefined" character in Cp1252. The first one
should be 0x81 not 0x83, since 0x83<->u_0192 is defined and works
correctly in Cp1252 charset). The mapping table you referred to is
"bestfit" type mapping table, in which it tries to provide the mapping
between the local encoding and the Unicode character set for those
characters not even exist in the local encoding. Personally I don't think
it's a good idea in most use scenario. All other official (from Microsoft)
or un-official mapping tables clearly mark these code points "undefined"
or "unused", for example

http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
http://en.wikipedia.org/wiki/Windows-1252
http://msdn.microsoft.com/en-us/library/cc195054.aspx

btw, code below is incorrect, or it does not work the way you might expect.

String name1 = new String( new String("兆源").getBytes("UTF-8"), "Cp1252");
String name2 = new String( name1.getBytes("Cp1252"), "UTF-8");

new String("兆源").getBytes("UTF-8") encodes your 2 Chinese character from
UTF-16 to UTF-8 bytes. It does not makes sense to then decode these UTF-8
bytes back to UTF-16 (which the String object uses) by using Cp1252 charset.

same for the second attempt.

What did you try to achieve? decode/encode between UTF-8 bytes and CP1252
bytes? It's not going to be a round-trip conversion for those non-ASCII characters.
Thanks Sherman for your explanation.

The problem occured when I was using JDBC with MySQL. The former application has stored the utf8 data to a default configured database ( with encoding is latin1 ), and get the data and decode in PHP is OK. But I failed in java when reading the data. From the document( http://dev.mysql.com/doc/refman/5.5/en/connector-j-reference-charsets.html ), latin1 in MySQL corresponds with Cp1252 in JAVA, so I found the cause, and I deem the guy here also encountered this problem ( http://forums.mysql.com/read.php?39,228068,228068#msg-228068 ).

As since the data in latin1(in java) can be converted to utf8 freely and vice versa. From the wikipedia Cp1252 is treated as a superset a ISO_8859-1, so I guess the same expectation on Cp1252 as latin1 is something natural, though it does not work now.

However, YMMV, would you mind give some suggestions on this? Thanks in advance.

Eric

Windows-1252 (cp1252) is a supperset of ISO 8859-1. ISO 8859-1 is normally referred as the latin-1. What we have in Java charset repository is ISO-8859-1. The difference between ISO 8859-1 and ISO-8859-1 (with dash and without dash) is the C0 and C1 control character
area. ISO-8859-1 has the C0 and C1 defined, ISO 8859-1 does not.

So in your above workaround, you'd better use ISO-8859-1, stead of cp1252.

I know little about JDBC + MySQL, so probably not the one to give suggestion on this topic. By simply reading the description of the problem you are facing with, I guess you'd better to set your client side encoding/charset correctly to utf-8 or gbk to receive result in Chinese
correctly.

-Sherman

Reply via email to