On Oct 06, 2006, at 17:15 UTC, Toby Rush wrote: > Thanks to Jay and Jonathan for their responses. It appears that I've > discovered a bug with decodeURLComponent, though I doubt myself a > little because I can't believe I'm the first one to run into it!
I don't think the problem is what you think it is. > In my initial message, I noted that the entering the following string > into a web form: > > Mü! (second letter is umlauted 'u') > > Causes the web browser to encode it as follows: > > M%FC%21 > > This encoding is UTF-8 (http://www.utf8-chartable.de/) No it isn't. Keep in mind that URL encoding is a bit like EncodeBase64; it simply encodes the bytes of the string, and says nothing about how to interpret these bytes as text. So what you have here, for the second character, is a single byte &hFC. That's not valid UTF-8 text, which has no single-byte characters greater than &h7F. That may be the right code point in UTF-8, but that's a coincidence; the actual encoding here is ISO-Latin-1. (UTF-8 uses the same code points as ISO-Latin-1 over the range of the latter.) And this is not surprising -- that's still sort of the "default" encoding on the web, when not specified as something else (which can be done in various convoluted ways). > Running this string through decodeURLComponent, however gives us this: You have to tell DecodeURLComponent what encoding to expect. One could argue that maybe it should assume ISO-Latin-1 by default, but it currently doesn't. > It appears that decodeURLComponent is assuming that the %xx values > are according to MacRoman (or SystemDefault) instead of UTF-8, which > appears to be the way that URLs are encoded nowadays. No, URLs are usually encoded as ISO-Latin-1. But I agree, that's probably what DecodeURLComponent is doing if you don't specify otherwise. > So should I submit this as a bug? Nope, no bug here. > Jon, your solution: > > > Try using DecodeURL by passing in the encoding parameter: > > > > s =3D DecodeURL( myString, Encodings.UTF8 ) This would be incorrect, because it's not UTF-8. Change that to Encodings.ISOLatin1, and it'll work fine. > Shouldn't decodeURLComponent do this, or at > least have a setting to indicate how the %xx entities are encoded? Indeed it does. Cheers, - Joe -- Joe Strout -- [EMAIL PROTECTED] Verified Express, LLC "Making the Internet a Better Place" http://www.verex.com/ _______________________________________________ Unsubscribe or switch delivery mode: <http://www.realsoftware.com/support/listmanager/> Search the archives of this list here: <http://support.realsoftware.com/listarchives/lists.html>
