Oleg Kalnichevski wrote:
Not a problem. And it's not intelligence; a) URI and URIUtil are not well documented, and b) character sets are a very messy area. Life would be a lot easier if everyone just switched to Unicode, but there's a lot of resistance (IMHO mostly political) to doing so.I apologize for restarting this conversation, but I have to confess I found myself not intelligent enough to be able to grasp grand designs of the UTIUtil#toUsingCharset method
Yes. This explains why URI preserves the original string you pass in, with the escape sequences in it. For example, if someone passes in a URI with % escapes in it, and the URI charset is JIS, you'd want to provide a way of accessing the original escaped string, and that's probably what you'd want to pass to the web server.Ok. If I understand you right, you are saying is that there are charsets that are inadequately represented in Unicode or not represented at all.
I've never been able to figure out *what* most of the charset methods in URIUtil are supposed to do, actually. The bit that confuses me is the same thing that's confusing you, I think...Absolutely fine with me. So, UTIUtil#toUsingCharset is supposedly needed to help preserve those characters when performing charset translations.
return new String(target.getBytes(fromCharset), toCharset);That's the crux of the matter right there. What the target.getBytes(fromCharset) does is ask the original "target" Unicode String (presumably containing % escapes) to convert itself to its byte representation in the original charset. Then "new String(..., toCharset) creates a new Unicode string while pretending those very same bytes we just created are in "toCharset", which is presumably a different charset. Any Unicode characters that have different encodings in those two character sets will end up changing in the second string, because the bytes will be written into the byte array using one character set, and then interpreted using another character set. And since some character set encodings are stateful, it's conceivable that you could even have "fromCharset" and "toCharset" values that caused the new String construction to blow up because the byte array was invalid for the toCharset decoder.
The part I'm having trouble with is *why* you'd want to do this. The whole point of Unicode (or one of them) is so that you don't have to remember what charset your byte arrays are in. Once you convert from a String to a byte array, you need to preserve the charset of that byte array. Suddenly pretending it's in a different charset is just going to screw things up.
I think I need to go read RFCs 1738 and 1808 and see if they're at all enlightening on this subject.
-- Laura
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
