On 12/29/06, Emmanuel Lecharny <[EMAIL PROTECTED]> wrote:
Ersin Er a écrit :> On 12/29/06, Emmanuel Lecharny <[EMAIL PROTECTED]> wrote: > >> AAAAAAHHHHHhhhh ! (Or is it \C3\C3\C3\C3\C3\C3HHHHHhhhh ? :) >> >> You will have to be a little bit more explicit... How do you build >> your RDN? >> FYI, it is supposed to be a UTF-8 encoded String, so if you are to >> code an >> ä, you will have to : >> - create a byte array containing it's counterpart (0xC3 0xa4) and do >> a new >> String( byteArray, "UTF-8" ) before passing it to the RDN constructor >> - OR do a new RDN( "\u00e4" ); >> >> never do a new RDN( "ä" ), because then the String will be considered as >> ISO-8859-1 encoded string (at least in Germany or in France, not in >> Turkey >> :) > > > What is the difference between creating an RDN with new RDN( "ä" ) and > with new String( new byte[] { 0xC3, 0xa4 }, "UTF-8" ) ? There is a _big_ difference, because your java file might have been saved using a ISO-8859-1 encoding. new RDN( "ä" ) just use the default encoding of your computer to store the file, and inside this file you have this "ä". There is no guarantee at all that it will be correct when you transform the string to UTF-8 bytes on another computer, using a different encoding. Using new String( new byte[] { 0xC3, 0xa4 }, "UTF-8" ) tells the compiler that the bytes are UTF-8 encoded (and UTF-8 = unicode encoded using bytes), and then it helps to translate the String to UTF-16. Of course, using \u00e4 should be the prefered way if you are to use internal Strings like "This is an umlaut : \u00e4" in your java file.
If your source code file contains "special characters" encoded in X encoding, and if you compile that code with javac using the encoding X (-encoding X), then there can be no problem. The so called special character is safely translated to Java internal encoding. There is no UTF-8 related stuff here. The X can be UTF-8 or not, that's all. You can create your source code with ISO-8859-1, and safely compile it without the encoding option while your platform encoding is ISO-8859-1. The special characters will be converted to safe Java UTF-16 forms. But if you send it to me, and if my platform encoding is ISO-8859-9 (Turkish), and if I compile it with just javac (no encoding option), the strings will be malformed (but will still compile). If I give the option -encoding ISO-8859-1 to the compiler, there will be no problem. There is still no problem related to UTF-8 here. A mini reference: http://www.jorendorff.com/articles/unicode/java.html
> There is > nothing as "UTF-8" String in Java. When you write new String( <some bytes>, "UTF-8" ), you just tell the JVM that the byte array is supposed to be a UTF-8 encoded String. It will trasnlate those bytes to UTF-16 chars, using one or two char if needed (Unicode can use up to 2^32 bits). For instance, the é in my name as a value of 0xE9 in Unicode, and is encoded 0xC3, 0xA9 in UTF-8. If you don't tell String() that the bytes array is UTF-8 encoded, then it will just consider that the byte array is using the default platform encoding. And if it's ISO8859-1, 0xC3 = 'Ã', and 0xA9 = '(c)', so you have now a Java String with is 2 chars long instead of one char long...
> All strings are UTF-16. You can get > their representations in other encodings as byte arrays. So when you > do a new RDN( "ä" ), it should be converted to UTF-16 internally. What > am I missing here? It is transformed to UTF-16 accordingling to the encoding used on your platform. But then, if your local encoding is ISO-8859-1, when doing a String.getBytes( "UTF-8" ), you might have something very different to that you were expecting. Ok, this is not simple. A simple rule then : *always use \uxxxx when encoding non ASCII characters in a java file* > > (Not being able to display the character in source code in other > platforms is a different matter. It's about the text editor encoding.) yes, but you always use an editor to write your java file... At this point, I may also miss something, but I would then like to have more informations like a test case which expose the problem. Emmanuel.
-- Ersin
