Ersin Er a écrit :

On 12/29/06, Emmanuel Lecharny <[EMAIL PROTECTED]> wrote:

AAAAAAHHHHHhhhh ! (Or is it \C3\C3\C3\C3\C3\C3HHHHHhhhh ? :)

You will have to be a little bit more explicit... How do you build your RDN? FYI, it is supposed to be a UTF-8 encoded String, so if you are to code an
ä, you will have to :
- create a byte array containing it's counterpart (0xC3 0xa4) and do a new
String( byteArray, "UTF-8" ) before passing it to the RDN constructor
- OR do a new RDN( "\u00e4" );

never do a new RDN( "ä" ), because then the String will be considered as
ISO-8859-1 encoded string (at least in Germany or in France, not in Turkey
:)


What is the difference between creating an RDN with new RDN( "ä" ) and
with new String( new byte[] { 0xC3, 0xa4 }, "UTF-8" ) ?

There is a _big_ difference, because your java file might have been saved using a ISO-8859-1 encoding. new RDN( "ä" ) just use the default encoding of your computer to store the file, and inside this file you have this "ä". There is no guarantee at all that it will be correct when you transform the string to UTF-8 bytes on another computer, using a different encoding. Using new String( new byte[] { 0xC3, 0xa4 }, "UTF-8" ) tells the compiler that the bytes are UTF-8 encoded (and UTF-8 = unicode encoded using bytes), and then it helps to translate the String to UTF-16. Of course, using \u00e4 should be the prefered way if you are to use internal Strings like "This is an umlaut : \u00e4" in your java file.

There is
nothing as "UTF-8" String in Java.

When you write new String( <some bytes>, "UTF-8" ), you just tell the JVM that the byte array is supposed to be a UTF-8 encoded String. It will trasnlate those bytes to UTF-16 chars, using one or two char if needed (Unicode can use up to 2^32 bits). For instance, the é in my name as a value of 0xE9 in Unicode, and is encoded 0xC3, 0xA9 in UTF-8. If you don't tell String() that the bytes array is UTF-8 encoded, then it will just consider that the byte array is using the default platform encoding. And if it's ISO8859-1, 0xC3 = 'Ã', and 0xA9 = '©', so you have now a Java String with is 2 chars long instead of one char long...

All strings are UTF-16. You can get
their representations in other encodings as byte arrays. So when you
do a new RDN( "ä" ), it should be converted to UTF-16 internally. What
am I missing here?

It is transformed to UTF-16 accordingling to the encoding used on your platform. But then, if your local encoding is ISO-8859-1, when doing a String.getBytes( "UTF-8" ), you might have something very different to that you were expecting.

Ok, this is not simple. A simple rule then :
*always use \uxxxx when encoding non ASCII characters in a java file*


(Not being able to display the character in source code in other
platforms is a different matter. It's about the text editor encoding.)

yes, but you always use an editor to write your java file...

At this point, I may also miss something, but I would then like to have more informations like a test case which expose the problem.

Emmanuel.

Reply via email to