Re: [Bug-wget] Redirect containing %2B behaves differently depending on locale

Tim Ruehsen Mon, 20 Apr 2015 08:00:19 -0700

Hi Ander,

sorry to answer so late. I was waiting for the mentioned test case... ;-)


Could you please fix this warning (add a prototype):
iri.c: In function 'do_conversion':
iri.c:139:3: warning: implicit declaration of function 
'url_unescape_except_reserved' [-Wimplicit-function-declaration]
   url_unescape_except_reserved (in);
   ^
url.c:217:1: warning: no previous prototype for 'url_unescape_except_reserved' 
[-Wmissing-prototypes]
 url_unescape_except_reserved (char *s)
 ^

Regards, Tim

On Monday 13 April 2015 17:03:23 Ander Juaristi wrote:
> On 04/03/2015 02:16 PM, Tim Rühsen wrote:
> > Hi Ander,
> > 
> > Am Freitag, 3. April 2015, 12:26:09 schrieb Ander Juaristi:
> >> On 03/13/2015 11:48 PM, Adam Sampson wrote:
> >>> Hi,
> >>> 
> >>> I've just found a case where wget 1.16.3 responds to a 302 redirect
> >>> differently depending on whether it's in an ASCII or UTF-8 locale.
> >>> 
> >>> This works:
> >>> LC_ALL=en_GB.UTF-8 wget
> >>> https://bitbucket.org/pypy/pypy/downloads/pypy-2.5.0-src.tar.bz2
> >>> 
> >>> This doesn't work:
> >>> LC_ALL=C wget
> >>> https://bitbucket.org/pypy/pypy/downloads/pypy-2.5.0-src.tar.bz2
> >>> 
> >>> I've attached logs with -d showing what's actually going on. The
> >>> 
> >>> initial request gives a 302 response with a Location: that contains:
> >>>     ....tar.bz2?Signature=up6%2BtTpSF...
> >>> 
> >>> In the UTF-8 locale, wget correctly redirects to that location.
> >>> 
> >>> In the ASCII locale, wget -d print a "converted: '...' -> '...'" line
> >>> 
> >>> (from iri.c's do_conversion), then redirects to:
> >>>     ....tar.bz2?Signature=up6+tTpSF...
> >>> 
> >>> (If you try it yourself you'll get a slightly different URL, but at
> >>> least for me it usually contains %2B somewhere.)
> >>> 
> >>> This appears to be because do_conversion calls url_unescape on the
> >>> input string it's given -- even though that input string is a _const_
> >>> char * in the code that calls it (main -> retrieve_url -> url_parse ->
> >>> remote_to_utf8 -> do_conversion). It's not immediately obvious to me
> >>> whether that's intentional or not; at the very least, it's a surprising
> >>> bit of behaviour.
> >> 
> >> That call to url_unescape() is necessary because iconv() needs the
> >> multibyte characters with no encoding. My first approach, by the way,
> >> was to remove that call, but that caused Test-iri-percent.px to fail,
> >> which is pretty clear.
> >> 
> >> The issue seems to be at the call to reencode_escapes(), just after
> >> remote_to_utf8() returns. The problem here is that %2B resolves to "+"
> >> (literal). And that character is equal to the reserved character "+", and
> >> reencode_escapes() treats it as a reserved characters and leaves it
> >> as-is.
> >> The same happens with other characters, such as "=" (%3D).
> >> 
> >> What I propose is to tag the characters that have been decoded, in
> >> url_unescape(), and then in reencode_escapes(), verify if they coincide
> >> with reserved characters as well.
> >> 
> >> What do you guys think?
> > 
> > Without looking at the code right now and from what you describe above,
> > your proposal sounds like a good idea. This problem pops up again and
> > again. If you solve the issue, some people will love you :-)
> > 
> > Regards, Tim
> 
> As promised, here it goes.
> 
> This works to me, although I'm expecting to send a test case in the
> following days.
> 
> I read RFC 3987 on which iri.c is based, and it proposed a better approach
> than mine for this specific case, concretely, in section 3.2 "Converting
> URIs to IRIs". Thus, I decided to implement that approach, which basically
> says that characters in "reserved" should *not* be unescaped prior to
> converting to UTF-8.

signature.asc
Description: This is a digitally signed message part.

Re: [Bug-wget] Redirect containing %2B behaves differently depending on locale

Reply via email to