Re: The use of UTIUtil.toUsingCharset?

Laura Werner Thu, 20 Feb 2003 10:36:47 -0800

Oleg Kalnichevski wrote:

I apologize for restarting this conversation, but I have to confess I
found myself not intelligent enough to be able to grasp grand designs of
the UTIUtil#toUsingCharset method

Not a problem. And it's not intelligence; a) URI and URIUtil are not well documented, and b) character sets are a very messy area. Life would be a lot easier if everyone just switched to Unicode, but there's a lot of resistance (IMHO mostly political) to doing so.

Ok. If I understand you right, you are saying is that there are charsets
that are inadequately represented in Unicode or not represented at all.

Yes. This explains why URI preserves the original string you pass in, with the escape sequences in it. For example, if someone passes in a URI with % escapes in it, and the URI charset is JIS, you'd want to provide a way of accessing the original escaped string, and that's probably what you'd want to pass to the web server.

Absolutely fine with me. So, UTIUtil#toUsingCharset is supposedly needed
to help preserve those characters when performing charset translations.

I've never been able to figure out *what* most of the charset methods in URIUtil are supposed to do, actually. The bit that confuses me is the same thing that's confusing you, I think...

return new String(target.getBytes(fromCharset), toCharset);

That's the crux of the matter right there. What the target.getBytes(fromCharset) does is ask the original "target" Unicode String (presumably containing % escapes) to convert itself to its byte representation in the original charset. Then "new String(..., toCharset) creates a new Unicode string while pretending those very same bytes we just created are in "toCharset", which is presumably a different charset. Any Unicode characters that have different encodings in those two character sets will end up changing in the second string, because the bytes will be written into the byte array using one character set, and then interpreted using another character set. And since some character set encodings are stateful, it's conceivable that you could even have "fromCharset" and "toCharset" values that caused the new String construction to blow up because the byte array was invalid for the toCharset decoder.

The part I'm having trouble with is *why* you'd want to do this. The whole point of Unicode (or one of them) is so that you don't have to remember what charset your byte arrays are in. Once you convert from a String to a byte array, you need to preserve the charset of that byte array. Suddenly pretending it's in a different charset is just going to screw things up.

I think I need to go read RFCs 1738 and 1808 and see if they're at all enlightening on this subject.

-- Laura

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: The use of UTIUtil.toUsingCharset?

Reply via email to