On Thu, Mar 01, 2007 at 07:53:52PM +0100, Marcel Ruff wrote: > >>>>Are you thinking of Java's _modified_ version of UTF-8 > >>>>(http://en.wikipedia.org/wiki/UTF-8#Java)? > The first sentence from the above wiki says: > > "In normal usage, the Java programming language > <http://en.wikipedia.org/wiki/Java_%28programming_language%29> supports > standard UTF-8 when reading and writing strings through > |InputStreamReader > <http://java.sun.com/javase/6/docs/api/java/io/InputStreamReader.html>| > and |OutputStreamWriter > <http://java.sun.com/javase/6/docs/api/java/io/OutputStreamWriter.html>"| > > and this is what i do to access sockets, so no problems here. > > But then it states that 'Supplementary multilingual plane' is encoded > incompatible.
Oh, you're talking about that part, not the NUL issue. Then yes, it's a major problem. Java generates and processes bogus illegal UTF-8 (surrogates). I don't know if there are any easy workarounds except to flame Sun to hell for being so stupid.. > So must i assume if i send 'mathematical alphanumeric symbols' > http://en.wikipedia.org/wiki/Mathematical_alphanumeric_symbols > like 'ℝ' from C to java they will be corrupted? ℝ is in the BMP, so no problem with it. It's just the huge pages of random letters in every single font/style imaginable that are outside the BMP. Of course various important CJK characters (needed for writing certain names) and historical scripts are also outside the BMP. > Both applications work with what they think is 'UTF-8' ... Yes. And Java is wrong. However, according to the Wikipedia article referenced, Java _does_ do the right thing in input and output streams. It's only the object serialization stuff that uses the bogus UTF-8. So I don't think you're likely to have problems in practice as long as you don't try to pass this data off (which would be in binary files anyway, I think...?) as UTF-8. Rich -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
