On Thu, Mar 01, 2007 at 07:53:52PM +0100, Marcel Ruff wrote:
> >>>>Are you thinking of Java's _modified_ version of UTF-8
> >>>>(http://en.wikipedia.org/wiki/UTF-8#Java)?
> The first sentence from the above wiki says:
> 
> "In normal usage, the Java programming language 
> <http://en.wikipedia.org/wiki/Java_%28programming_language%29> supports 
> standard UTF-8 when reading and writing strings through 
> |InputStreamReader 
> <http://java.sun.com/javase/6/docs/api/java/io/InputStreamReader.html>| 
> and |OutputStreamWriter 
> <http://java.sun.com/javase/6/docs/api/java/io/OutputStreamWriter.html>"|
> 
> and this is what i do to access sockets, so no problems here.
> 
> But then it states that 'Supplementary multilingual plane' is encoded 
> incompatible.

Oh, you're talking about that part, not the NUL issue. Then yes, it's
a major problem. Java generates and processes bogus illegal UTF-8
(surrogates). I don't know if there are any easy workarounds except to
flame Sun to hell for being so stupid..

> So must i assume if i send 'mathematical alphanumeric symbols'
> http://en.wikipedia.org/wiki/Mathematical_alphanumeric_symbols
> like 'ℝ' from C to java they will be corrupted?

ℝ is in the BMP, so no problem with it. It's just the huge pages of
random letters in every single font/style imaginable that are outside
the BMP. Of course various important CJK characters (needed for
writing certain names) and historical scripts are also outside the
BMP.

> Both applications work with what they think is 'UTF-8' ...

Yes. And Java is wrong. However, according to the Wikipedia article
referenced, Java _does_ do the right thing in input and output
streams. It's only the object serialization stuff that uses the bogus
UTF-8. So I don't think you're likely to have problems in practice as
long as you don't try to pass this data off (which would be in binary
files anyway, I think...?) as UTF-8.

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to