Hi Ander, sorry to answer so late. I was waiting for the mentioned test case... ;-)
Could you please fix this warning (add a prototype): iri.c: In function 'do_conversion': iri.c:139:3: warning: implicit declaration of function 'url_unescape_except_reserved' [-Wimplicit-function-declaration] url_unescape_except_reserved (in); ^ url.c:217:1: warning: no previous prototype for 'url_unescape_except_reserved' [-Wmissing-prototypes] url_unescape_except_reserved (char *s) ^ Regards, Tim On Monday 13 April 2015 17:03:23 Ander Juaristi wrote: > On 04/03/2015 02:16 PM, Tim Rühsen wrote: > > Hi Ander, > > > > Am Freitag, 3. April 2015, 12:26:09 schrieb Ander Juaristi: > >> On 03/13/2015 11:48 PM, Adam Sampson wrote: > >>> Hi, > >>> > >>> I've just found a case where wget 1.16.3 responds to a 302 redirect > >>> differently depending on whether it's in an ASCII or UTF-8 locale. > >>> > >>> This works: > >>> LC_ALL=en_GB.UTF-8 wget > >>> https://bitbucket.org/pypy/pypy/downloads/pypy-2.5.0-src.tar.bz2 > >>> > >>> This doesn't work: > >>> LC_ALL=C wget > >>> https://bitbucket.org/pypy/pypy/downloads/pypy-2.5.0-src.tar.bz2 > >>> > >>> I've attached logs with -d showing what's actually going on. The > >>> > >>> initial request gives a 302 response with a Location: that contains: > >>> ....tar.bz2?Signature=up6%2BtTpSF... > >>> > >>> In the UTF-8 locale, wget correctly redirects to that location. > >>> > >>> In the ASCII locale, wget -d print a "converted: '...' -> '...'" line > >>> > >>> (from iri.c's do_conversion), then redirects to: > >>> ....tar.bz2?Signature=up6+tTpSF... > >>> > >>> (If you try it yourself you'll get a slightly different URL, but at > >>> least for me it usually contains %2B somewhere.) > >>> > >>> This appears to be because do_conversion calls url_unescape on the > >>> input string it's given -- even though that input string is a _const_ > >>> char * in the code that calls it (main -> retrieve_url -> url_parse -> > >>> remote_to_utf8 -> do_conversion). It's not immediately obvious to me > >>> whether that's intentional or not; at the very least, it's a surprising > >>> bit of behaviour. > >> > >> That call to url_unescape() is necessary because iconv() needs the > >> multibyte characters with no encoding. My first approach, by the way, > >> was to remove that call, but that caused Test-iri-percent.px to fail, > >> which is pretty clear. > >> > >> The issue seems to be at the call to reencode_escapes(), just after > >> remote_to_utf8() returns. The problem here is that %2B resolves to "+" > >> (literal). And that character is equal to the reserved character "+", and > >> reencode_escapes() treats it as a reserved characters and leaves it > >> as-is. > >> The same happens with other characters, such as "=" (%3D). > >> > >> What I propose is to tag the characters that have been decoded, in > >> url_unescape(), and then in reencode_escapes(), verify if they coincide > >> with reserved characters as well. > >> > >> What do you guys think? > > > > Without looking at the code right now and from what you describe above, > > your proposal sounds like a good idea. This problem pops up again and > > again. If you solve the issue, some people will love you :-) > > > > Regards, Tim > > As promised, here it goes. > > This works to me, although I'm expecting to send a test case in the > following days. > > I read RFC 3987 on which iri.c is based, and it proposed a better approach > than mine for this specific case, concretely, in section 3.2 "Converting > URIs to IRIs". Thus, I decided to implement that approach, which basically > says that characters in "reserved" should *not* be unescaped prior to > converting to UTF-8.
signature.asc
Description: This is a digitally signed message part.
