On 09/02/2011 02:14 AM, Eric Liang wrote:
On 09/02/2011 04:04 AM, Xueming Shen wrote:
Hi,
These 5 code points are "undefined" character in Cp1252. The first one
should be 0x81 not 0x83, since 0x83<->u_0192 is defined and works
correctly in Cp1252 charset). The mapping table you referred to is
"bestfit" type mapping table, in which it tries to provide the mapping
between the local encoding and the Unicode character set for those
characters not even exist in the local encoding. Personally I don't think
it's a good idea in most use scenario. All other official (from
Microsoft)
or un-official mapping tables clearly mark these code points "undefined"
or "unused", for example
http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
http://en.wikipedia.org/wiki/Windows-1252
http://msdn.microsoft.com/en-us/library/cc195054.aspx
btw, code below is incorrect, or it does not work the way you might
expect.
String name1 = new String( new String("兆源").getBytes("UTF-8"),
"Cp1252");
String name2 = new String( name1.getBytes("Cp1252"), "UTF-8");
new String("兆源").getBytes("UTF-8") encodes your 2 Chinese character
from
UTF-16 to UTF-8 bytes. It does not makes sense to then decode these UTF-8
bytes back to UTF-16 (which the String object uses) by using Cp1252
charset.
same for the second attempt.
What did you try to achieve? decode/encode between UTF-8 bytes and CP1252
bytes? It's not going to be a round-trip conversion for those
non-ASCII characters.
Thanks Sherman for your explanation.
The problem occured when I was using JDBC with MySQL. The former
application has stored the utf8 data to a default configured database
( with encoding is latin1 ), and get the data and decode in PHP is OK.
But I failed in java when reading the data. From the document(
http://dev.mysql.com/doc/refman/5.5/en/connector-j-reference-charsets.html
), latin1 in MySQL corresponds with Cp1252 in JAVA, so I found the
cause, and I deem the guy here also encountered this problem (
http://forums.mysql.com/read.php?39,228068,228068#msg-228068 ).
As since the data in latin1(in java) can be converted to utf8 freely
and vice versa. From the wikipedia Cp1252 is treated as a superset a
ISO_8859-1, so I guess the same expectation on Cp1252 as latin1 is
something natural, though it does not work now.
However, YMMV, would you mind give some suggestions on this? Thanks
in advance.
Eric
Windows-1252 (cp1252) is a supperset of ISO 8859-1. ISO 8859-1 is
normally referred as
the latin-1. What we have in Java charset repository is ISO-8859-1. The
difference between
ISO 8859-1 and ISO-8859-1 (with dash and without dash) is the C0 and C1
control character
area. ISO-8859-1 has the C0 and C1 defined, ISO 8859-1 does not.
So in your above workaround, you'd better use ISO-8859-1, stead of cp1252.
I know little about JDBC + MySQL, so probably not the one to give
suggestion on this topic.
By simply reading the description of the problem you are facing with, I
guess you'd better
to set your client side encoding/charset correctly to utf-8 or gbk to
receive result in Chinese
correctly.
-Sherman