Laura

Finally, there's someone who can read Sung-Gu's mind! 

All right. A simple phrase "There are charsets that are not adequately
represented in Unicode" by Sung-Gu would have put the discussion into a
completely different perspective. And of course, Sung-Gu's stoical
refusal to provide a test case for the method did not help either. 

Many thanks

Oleg 



On Tue, 2003-02-04 at 22:51, Laura Werner wrote:
> Hi Sung-Gu,
> 
> >Actually, that's very easy...
> >And not that important unless it's not going to be support multilinqual.
> >
> >As you see the diagram, bytes informations created from the original charset
> >should be restored.  That's all.
> >  
> >
> My understanding of what you're saying is that if someone constructs a 
> URI using escaped characters in a particular charset (e.g. Big-5), using 
> the URI(char[] escaped) constructor, then URI needs to preserve those 
> characters.  If someone asks for the URI back as an escaped string in 
> the original charset (e.g. Big-5 again), we need to give them the 
> *exact* original string; it's not good enough to trancode from the 
> escaped Big-5 string to Unicode and back to Big-5.  Is this correct?
> 
> If this is true, I have a few comments on why this matters...
> 
> -- First, for those who don't understand why you can't just convert 
> everything to Unicode and stop worrying, there is some sense behind 
> this.  When Unicode was invented, the far-east languages were "Unified" 
> into the Han block of Unicode.  Some characters that have distinct codes 
> in the native double-byte character sets were mapped to single Unicode 
> characters.  This meant that some native character sets wouldn't round 
> trip to Unicode and back.  It was essentially a political compromise -- 
> the Unicode folks needed to save space in the 64k base plane, so they 
> merged Han characters that meant very similar things and looked almost 
> exactly same.  (Emphasis "similar" and "almost".)  But in native 
> charsets that didn't need to have room for Korean and Cyrillic and all 
> the other stuff that's in Unicode, there's room to split out multiple 
> versions of these characters that are merged together.
> 
> -- There are also a few new character sets like JIS-212 that contain 
> characters (like Japanese dental symbols, believe it or not) that 
> haven't been encoded in Unicode yet.  Presumably we'd want to keep the 
> encoded URI string around so that we can preserve this kind of character.
> 
> (In a past life I managed the Unicode group at IBM, and I remember far 
> more of this stuff than I thought I did.)
> 
> A few comments on URI.java and URIUtil.java
> 
> -- I think the comments need to be greatly improved.  It's very hard to 
> figure out what many of the methods do.  In the cases where I can figure 
> out what they do, it's hard to figure out *why*. 
> 
> -- It would be nice if the documentation explained the charset concepts: 
> What is a document charset and a protocol charset and so on.  A 
> reference to the RFC is nice, but a more concice explanation in the 
> JavaDoc would be better.
> 
> Laura, hoping I helped answer part of the "why" here, at least
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to