Laura Finally, there's someone who can read Sung-Gu's mind!
All right. A simple phrase "There are charsets that are not adequately represented in Unicode" by Sung-Gu would have put the discussion into a completely different perspective. And of course, Sung-Gu's stoical refusal to provide a test case for the method did not help either. Many thanks Oleg On Tue, 2003-02-04 at 22:51, Laura Werner wrote: > Hi Sung-Gu, > > >Actually, that's very easy... > >And not that important unless it's not going to be support multilinqual. > > > >As you see the diagram, bytes informations created from the original charset > >should be restored. That's all. > > > > > My understanding of what you're saying is that if someone constructs a > URI using escaped characters in a particular charset (e.g. Big-5), using > the URI(char[] escaped) constructor, then URI needs to preserve those > characters. If someone asks for the URI back as an escaped string in > the original charset (e.g. Big-5 again), we need to give them the > *exact* original string; it's not good enough to trancode from the > escaped Big-5 string to Unicode and back to Big-5. Is this correct? > > If this is true, I have a few comments on why this matters... > > -- First, for those who don't understand why you can't just convert > everything to Unicode and stop worrying, there is some sense behind > this. When Unicode was invented, the far-east languages were "Unified" > into the Han block of Unicode. Some characters that have distinct codes > in the native double-byte character sets were mapped to single Unicode > characters. This meant that some native character sets wouldn't round > trip to Unicode and back. It was essentially a political compromise -- > the Unicode folks needed to save space in the 64k base plane, so they > merged Han characters that meant very similar things and looked almost > exactly same. (Emphasis "similar" and "almost".) But in native > charsets that didn't need to have room for Korean and Cyrillic and all > the other stuff that's in Unicode, there's room to split out multiple > versions of these characters that are merged together. > > -- There are also a few new character sets like JIS-212 that contain > characters (like Japanese dental symbols, believe it or not) that > haven't been encoded in Unicode yet. Presumably we'd want to keep the > encoded URI string around so that we can preserve this kind of character. > > (In a past life I managed the Unicode group at IBM, and I remember far > more of this stuff than I thought I did.) > > A few comments on URI.java and URIUtil.java > > -- I think the comments need to be greatly improved. It's very hard to > figure out what many of the methods do. In the cases where I can figure > out what they do, it's hard to figure out *why*. > > -- It would be nice if the documentation explained the charset concepts: > What is a document charset and a protocol charset and so on. A > reference to the RFC is nice, but a more concice explanation in the > JavaDoc would be better. > > Laura, hoping I helped answer part of the "why" here, at least > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
