Re: Maybe codec bug in MS1252, i.e., encoding Cp1252

Xueming Shen Fri, 02 Sep 2011 12:53:18 -0700

On 09/02/2011 02:14 AM, Eric Liang wrote:

On 09/02/2011 04:04 AM, Xueming Shen wrote:
Hi,
These 5 code points are "undefined" character in Cp1252. The first one
should be 0x81 not 0x83, since 0x83<->u_0192 is defined and works
correctly in Cp1252 charset). The mapping table you referred to is
"bestfit" type mapping table, in which it tries to provide the mapping
between the local encoding and the Unicode character set for those
characters not even exist in the local encoding. Personally I don't think
it's a good idea in most use scenario. All other official (fromMicrosoft)
or un-official mapping tables clearly mark these code points "undefined"
or "unused", for example

http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
http://en.wikipedia.org/wiki/Windows-1252
http://msdn.microsoft.com/en-us/library/cc195054.aspx
btw, code below is incorrect, or it does not work the way you mightexpect.
String name1 = new String( new String("兆源").getBytes("UTF-8"),"Cp1252");
String name2 = new String( name1.getBytes("Cp1252"), "UTF-8");
new String("兆源").getBytes("UTF-8") encodes your 2 Chinese characterfrom
UTF-16 to UTF-8 bytes. It does not makes sense to then decode these UTF-8
bytes back to UTF-16 (which the String object uses) by using Cp1252charset.
same for the second attempt.

What did you try to achieve? decode/encode between UTF-8 bytes and CP1252
bytes? It's not going to be a round-trip conversion for thosenon-ASCII characters.
Thanks Sherman for your explanation.
The problem occured when I was using JDBC with MySQL. The formerapplication has stored the utf8 data to a default configured database( with encoding is latin1 ), and get the data and decode in PHP is OK.But I failed in java when reading the data. From the document(http://dev.mysql.com/doc/refman/5.5/en/connector-j-reference-charsets.html), latin1 in MySQL corresponds with Cp1252 in JAVA, so I found thecause, and I deem the guy here also encountered this problem (http://forums.mysql.com/read.php?39,228068,228068#msg-228068 ).
As since the data in latin1(in java) can be converted to utf8 freelyand vice versa. From the wikipedia Cp1252 is treated as a superset aISO_8859-1, so I guess the same expectation on Cp1252 as latin1 issomething natural, though it does not work now.
However, YMMV, would you mind give some suggestions on this? Thanksin advance.
Eric

Windows-1252 (cp1252) is a supperset of ISO 8859-1. ISO 8859-1 isnormally referred asthe latin-1. What we have in Java charset repository is ISO-8859-1. Thedifference betweenISO 8859-1 and ISO-8859-1 (with dash and without dash) is the C0 and C1control character

area. ISO-8859-1 has the C0 and C1 defined, ISO 8859-1 does not.

So in your above workaround, you'd better use ISO-8859-1, stead of cp1252.

I know little about JDBC + MySQL, so probably not the one to givesuggestion on this topic.By simply reading the description of the problem you are facing with, Iguess you'd betterto set your client side encoding/charset correctly to utf-8 or gbk toreceive result in Chinese

correctly.

-Sherman

Re: Maybe codec bug in MS1252, i.e., encoding Cp1252

Reply via email to