On Oct 06, 2006, at 17:15 UTC, Toby Rush wrote:

> Thanks to Jay and Jonathan for their responses. It appears that I've  
> discovered a bug with decodeURLComponent, though I doubt myself a  
> little because I can't believe I'm the first one to run into it!

I don't think the problem is what you think it is.

> In my initial message, I noted that the entering the following string
> into a web form:
> 
> Mü! (second letter is umlauted 'u')
> 
> Causes the web browser to encode it as follows:
> 
> M%FC%21
> 
> This encoding is UTF-8 (http://www.utf8-chartable.de/)

No it isn't.  Keep in mind that URL encoding is a bit like
EncodeBase64; it simply encodes the bytes of the string, and says
nothing about how to interpret these bytes as text.  So what you have
here, for the second character, is a single byte &hFC.  That's not
valid UTF-8 text, which has no single-byte characters greater than &h7F.

That may be the right code point in UTF-8, but that's a coincidence;
the actual encoding here is ISO-Latin-1.  (UTF-8 uses the same code
points as ISO-Latin-1 over the range of the latter.)  And this is not
surprising -- that's still sort of the "default" encoding on the web,
when not specified as something else (which can be done in various
convoluted ways).

> Running this string through decodeURLComponent, however gives us this:

You have to tell DecodeURLComponent what encoding to expect.  One could
argue that maybe it should assume ISO-Latin-1 by default, but it
currently doesn't.

> It appears that decodeURLComponent is assuming that the %xx values  
> are according to MacRoman (or SystemDefault) instead of UTF-8, which  
> appears to be the way that URLs are encoded nowadays.

No, URLs are usually encoded as ISO-Latin-1.  But I agree, that's
probably what DecodeURLComponent is doing if you don't specify
otherwise.

> So should I submit this as a bug?

Nope, no bug here.

> Jon, your solution:
> 
> > Try using DecodeURL by passing in the encoding parameter:
> >
> > s =3D DecodeURL( myString, Encodings.UTF8 )

This would be incorrect, because it's not UTF-8.  Change that to
Encodings.ISOLatin1, and it'll work fine.

> Shouldn't decodeURLComponent do this, or at  
> least have a setting to indicate how the %xx entities are encoded?

Indeed it does.

Cheers,
- Joe

--
Joe Strout -- [EMAIL PROTECTED]
Verified Express, LLC     "Making the Internet a Better Place"
http://www.verex.com/

_______________________________________________
Unsubscribe or switch delivery mode:
<http://www.realsoftware.com/support/listmanager/>

Search the archives of this list here:
<http://support.realsoftware.com/listarchives/lists.html>

Reply via email to