Re: UTF-8 woes

Ersin Er Fri, 29 Dec 2006 12:09:10 -0800

On 12/29/06, Emmanuel Lecharny <[EMAIL PROTECTED]> wrote:

Ersin Er a écrit :

> On 12/29/06, Emmanuel Lecharny <[EMAIL PROTECTED]> wrote:
>
>> AAAAAAHHHHHhhhh ! (Or is it \C3\C3\C3\C3\C3\C3HHHHHhhhh ? :)
>>
>> You will have to be a little bit more explicit... How do you build
>> your RDN?
>> FYI, it is supposed to be a UTF-8 encoded String, so if you are to
>> code an
>> ä, you will have to :
>> - create a byte array containing it's counterpart (0xC3 0xa4) and do
>> a new
>> String( byteArray, "UTF-8" ) before passing it to the RDN constructor
>> - OR do a new RDN( "\u00e4" );
>>
>> never do a new RDN( "ä" ), because then the String will be considered as
>> ISO-8859-1 encoded  string (at least in Germany or in France, not in
>> Turkey
>> :)
>
>
> What is the difference between creating an RDN with new RDN( "ä" ) and
> with new String( new byte[] { 0xC3, 0xa4 }, "UTF-8" ) ?

There is a _big_ difference, because your java file might have been
saved using a ISO-8859-1 encoding. new RDN( "ä" ) just use the default
encoding of your computer to store the file, and inside this file you
have this "ä". There is no guarantee at all that it will be correct when
you transform the string to UTF-8 bytes on another computer, using a
different encoding. Using new String( new byte[] { 0xC3, 0xa4 }, "UTF-8"
) tells the compiler that the bytes are UTF-8 encoded (and UTF-8 =
unicode encoded using bytes), and then it helps to translate the String
to UTF-16. Of course, using \u00e4 should be the prefered way if you are
to use internal Strings like "This is an umlaut : \u00e4" in your java file.


If your source code file contains "special characters" encoded in X
encoding, and if you compile that code with javac using the encoding X
(-encoding X), then there can be no problem. The so called special
character is safely translated to Java internal encoding. There is no
UTF-8 related stuff here. The X can be UTF-8 or not, that's all.

You can create your source code with ISO-8859-1, and safely compile it
without the encoding option while your platform encoding is
ISO-8859-1. The special characters will be converted to safe Java
UTF-16 forms. But if you send it to me, and if my platform encoding is
ISO-8859-9 (Turkish), and if I compile it with just javac (no encoding
option), the strings will be malformed (but will still compile). If I
give the option -encoding ISO-8859-1 to the compiler, there will be no
problem. There is still no problem related to UTF-8 here.

A mini reference: http://www.jorendorff.com/articles/unicode/java.html

> There is
> nothing as "UTF-8" String in Java.

When you write new String( <some bytes>, "UTF-8" ), you just tell the
JVM that the byte array is supposed to be a UTF-8 encoded String. It
will trasnlate those bytes to UTF-16 chars, using one or two char if
needed (Unicode can use up to 2^32 bits). For instance, the é in my name
as a value of 0xE9 in Unicode, and is encoded 0xC3, 0xA9 in UTF-8. If
you don't tell String() that the bytes array is UTF-8 encoded, then it
will just consider that the byte array is using the default platform
encoding. And if it's ISO8859-1, 0xC3 = 'Ã', and 0xA9 = '(c)', so you have
now a Java String with is 2 chars long instead of one char long...

> All strings are UTF-16. You can get
> their representations in other encodings as byte arrays. So when you
> do a new RDN( "ä" ), it should be converted to UTF-16 internally. What
> am I missing here?

It is transformed to UTF-16 accordingling to the encoding used on your
platform. But then, if your local encoding is ISO-8859-1, when doing a
String.getBytes( "UTF-8" ), you might have something very different to
that you were expecting.

Ok, this is not simple. A simple rule then :
*always use \uxxxx when encoding non ASCII characters in a java file*

>
> (Not being able to display the character in source code in other
> platforms is a different matter. It's about the text editor encoding.)

yes, but you always use an editor to write your java file...

At this point, I may also miss something, but I would then like to have
more informations like a test case which expose the problem.

Emmanuel.



--
Ersin

Re: UTF-8 woes

Reply via email to