Re: UTF-8 woes

Emmanuel Lecharny Fri, 29 Dec 2006 11:37:35 -0800

Ersin Er a écrit :

On 12/29/06, Emmanuel Lecharny <[EMAIL PROTECTED]> wrote:
AAAAAAHHHHHhhhh ! (Or is it \C3\C3\C3\C3\C3\C3HHHHHhhhh ? :)
You will have to be a little bit more explicit... How do you buildyour RDN?FYI, it is supposed to be a UTF-8 encoded String, so if you are tocode an
ä, you will have to :
- create a byte array containing it's counterpart (0xC3 0xa4) and doa new
String( byteArray, "UTF-8" ) before passing it to the RDN constructor
- OR do a new RDN( "\u00e4" );

never do a new RDN( "ä" ), because then the String will be considered as
ISO-8859-1 encoded string (at least in Germany or in France, not inTurkey
:)
What is the difference between creating an RDN with new RDN( "ä" ) and
with new String( new byte[] { 0xC3, 0xa4 }, "UTF-8" ) ?

There is a _big_ difference, because your java file might have beensaved using a ISO-8859-1 encoding. new RDN( "ä" ) just use the defaultencoding of your computer to store the file, and inside this file youhave this "ä". There is no guarantee at all that it will be correct whenyou transform the string to UTF-8 bytes on another computer, using adifferent encoding. Using new String( new byte[] { 0xC3, 0xa4 }, "UTF-8") tells the compiler that the bytes are UTF-8 encoded (and UTF-8 =unicode encoded using bytes), and then it helps to translate the Stringto UTF-16. Of course, using \u00e4 should be the prefered way if you areto use internal Strings like "This is an umlaut : \u00e4" in your java file.

There is
nothing as "UTF-8" String in Java.

When you write new String( <some bytes>, "UTF-8" ), you just tell theJVM that the byte array is supposed to be a UTF-8 encoded String. Itwill trasnlate those bytes to UTF-16 chars, using one or two char ifneeded (Unicode can use up to 2^32 bits). For instance, the é in my nameas a value of 0xE9 in Unicode, and is encoded 0xC3, 0xA9 in UTF-8. Ifyou don't tell String() that the bytes array is UTF-8 encoded, then itwill just consider that the byte array is using the default platformencoding. And if it's ISO8859-1, 0xC3 = 'Ã', and 0xA9 = '©', so you havenow a Java String with is 2 chars long instead of one char long...

All strings are UTF-16. You can get
their representations in other encodings as byte arrays. So when you
do a new RDN( "ä" ), it should be converted to UTF-16 internally. What
am I missing here?

It is transformed to UTF-16 accordingling to the encoding used on yourplatform. But then, if your local encoding is ISO-8859-1, when doing aString.getBytes( "UTF-8" ), you might have something very different tothat you were expecting.


Ok, this is not simple. A simple rule then :
*always use \uxxxx when encoding non ASCII characters in a java file*


(Not being able to display the character in source code in other
platforms is a different matter. It's about the text editor encoding.)


yes, but you always use an editor to write your java file...

At this point, I may also miss something, but I would then like to havemore informations like a test case which expose the problem.


Emmanuel.

Re: UTF-8 woes

Reply via email to