Rich Felker wrote:
On Thu, Mar 01, 2007 at 07:53:52PM +0100, Marcel Ruff wrote:
Are you thinking of Java's _modified_ version of UTF-8
(http://en.wikipedia.org/wiki/UTF-8#Java)?
The first sentence from the above wiki says:
"In normal usage, the Java programming language
<http://en.wikipedia.org/wiki/Java_%28programming_language%29> supports
standard UTF-8 when reading and writing strings through
|InputStreamReader
<http://java.sun.com/javase/6/docs/api/java/io/InputStreamReader.html>|
and |OutputStreamWriter
<http://java.sun.com/javase/6/docs/api/java/io/OutputStreamWriter.html>"|
and this is what i do to access sockets, so no problems here.
But then it states that 'Supplementary multilingual plane' is encoded
incompatible.
Oh, you're talking about that part, not the NUL issue. Then yes, it's
a major problem. Java generates and processes bogus illegal UTF-8
(surrogates). I don't know if there are any easy workarounds except to
flame Sun to hell for being so stupid..
So must i assume if i send 'mathematical alphanumeric symbols'
http://en.wikipedia.org/wiki/Mathematical_alphanumeric_symbols
like 'ℝ' from C to java they will be corrupted?
ℝ is in the BMP, so no problem with it. It's just the huge pages of
random letters in every single font/style imaginable that are outside
the BMP. Of course various important CJK characters (needed for
writing certain names) and historical scripts are also outside the
BMP.
Both applications work with what they think is 'UTF-8' ...
Yes. And Java is wrong. However, according to the Wikipedia article
referenced, Java _does_ do the right thing in input and output
streams. It's only the object serialization stuff that uses the bogus
UTF-8. So I don't think you're likely to have problems in practice as
long as you don't try to pass this data off (which would be in binary
files anyway, I think...?) as UTF-8.
Ok, thanks, so porting legacy C/C++ to unicode UTF-8 is simple :-)
Marcel
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/