Re: The use of UTIUtil.toUsingCharset?

Laura Werner Tue, 04 Feb 2003 13:57:35 -0800

Hi Sung-Gu,

Actually, that's very easy...
And not that important unless it's not going to be support multilinqual.

As you see the diagram, bytes informations created from the original charset
should be restored. That's all.

My understanding of what you're saying is that if someone constructs a URI using escaped characters in a particular charset (e.g. Big-5), using the URI(char[] escaped) constructor, then URI needs to preserve those characters. If someone asks for the URI back as an escaped string in the original charset (e.g. Big-5 again), we need to give them the *exact* original string; it's not good enough to trancode from the escaped Big-5 string to Unicode and back to Big-5. Is this correct?

If this is true, I have a few comments on why this matters...

-- First, for those who don't understand why you can't just convert everything to Unicode and stop worrying, there is some sense behind this. When Unicode was invented, the far-east languages were "Unified" into the Han block of Unicode. Some characters that have distinct codes in the native double-byte character sets were mapped to single Unicode characters. This meant that some native character sets wouldn't round trip to Unicode and back. It was essentially a political compromise -- the Unicode folks needed to save space in the 64k base plane, so they merged Han characters that meant very similar things and looked almost exactly same. (Emphasis "similar" and "almost".) But in native charsets that didn't need to have room for Korean and Cyrillic and all the other stuff that's in Unicode, there's room to split out multiple versions of these characters that are merged together.

-- There are also a few new character sets like JIS-212 that contain characters (like Japanese dental symbols, believe it or not) that haven't been encoded in Unicode yet. Presumably we'd want to keep the encoded URI string around so that we can preserve this kind of character.

(In a past life I managed the Unicode group at IBM, and I remember far more of this stuff than I thought I did.)

A few comments on URI.java and URIUtil.java

-- I think the comments need to be greatly improved. It's very hard to figure out what many of the methods do. In the cases where I can figure out what they do, it's hard to figure out *why*.
-- It would be nice if the documentation explained the charset concepts: What is a document charset and a protocol charset and so on. A reference to the RFC is nice, but a more concice explanation in the JavaDoc would be better.

Laura, hoping I helped answer part of the "why" here, at least

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: The use of UTIUtil.toUsingCharset?

Reply via email to