Hi Ander, Am Freitag, 3. April 2015, 12:26:09 schrieb Ander Juaristi: > On 03/13/2015 11:48 PM, Adam Sampson wrote: > > Hi, > > > > I've just found a case where wget 1.16.3 responds to a 302 redirect > > differently depending on whether it's in an ASCII or UTF-8 locale. > > > > This works: > > LC_ALL=en_GB.UTF-8 wget > > https://bitbucket.org/pypy/pypy/downloads/pypy-2.5.0-src.tar.bz2 > > > > This doesn't work: > > LC_ALL=C wget > > https://bitbucket.org/pypy/pypy/downloads/pypy-2.5.0-src.tar.bz2 > > > > I've attached logs with -d showing what's actually going on. The > > > > initial request gives a 302 response with a Location: that contains: > > ....tar.bz2?Signature=up6%2BtTpSF... > > > > In the UTF-8 locale, wget correctly redirects to that location. > > > > In the ASCII locale, wget -d print a "converted: '...' -> '...'" line > > > > (from iri.c's do_conversion), then redirects to: > > ....tar.bz2?Signature=up6+tTpSF... > > > > (If you try it yourself you'll get a slightly different URL, but at > > least for me it usually contains %2B somewhere.) > > > > This appears to be because do_conversion calls url_unescape on the > > input string it's given -- even though that input string is a _const_ > > char * in the code that calls it (main -> retrieve_url -> url_parse -> > > remote_to_utf8 -> do_conversion). It's not immediately obvious to me > > whether that's intentional or not; at the very least, it's a surprising > > bit of behaviour. > > That call to url_unescape() is necessary because iconv() needs the multibyte > characters with no encoding. My first approach, by the way, was to remove > that call, but that caused Test-iri-percent.px to fail, which is pretty > clear. > > The issue seems to be at the call to reencode_escapes(), just after > remote_to_utf8() returns. The problem here is that %2B resolves to "+" > (literal). And that character is equal to the reserved character "+", and > reencode_escapes() treats it as a reserved characters and leaves it as-is. > The same happens with other characters, such as "=" (%3D). > > What I propose is to tag the characters that have been decoded, in > url_unescape(), and then in reencode_escapes(), verify if they coincide > with reserved characters as well. > > What do you guys think?
Without looking at the code right now and from what you describe above, your proposal sounds like a good idea. This problem pops up again and again. If you solve the issue, some people will love you :-) Regards, Tim
signature.asc
Description: This is a digitally signed message part.
