Re: [Bug-wget] Redirect containing %2B behaves differently depending on locale
Hi Ander, sorry to answer so late. I was waiting for the mentioned test case... ;-) Could you please fix this warning (add a prototype): iri.c: In function 'do_conversion': iri.c:139:3: warning: implicit declaration of function 'url_unescape_except_reserved' [-Wimplicit-function-declaration] url_unescape_except_reserved (in); ^ url.c:217:1: warning: no previous prototype for 'url_unescape_except_reserved' [-Wmissing-prototypes] url_unescape_except_reserved (char *s) ^ Regards, Tim On Monday 13 April 2015 17:03:23 Ander Juaristi wrote: On 04/03/2015 02:16 PM, Tim Rühsen wrote: Hi Ander, Am Freitag, 3. April 2015, 12:26:09 schrieb Ander Juaristi: On 03/13/2015 11:48 PM, Adam Sampson wrote: Hi, I've just found a case where wget 1.16.3 responds to a 302 redirect differently depending on whether it's in an ASCII or UTF-8 locale. This works: LC_ALL=en_GB.UTF-8 wget https://bitbucket.org/pypy/pypy/downloads/pypy-2.5.0-src.tar.bz2 This doesn't work: LC_ALL=C wget https://bitbucket.org/pypy/pypy/downloads/pypy-2.5.0-src.tar.bz2 I've attached logs with -d showing what's actually going on. The initial request gives a 302 response with a Location: that contains: tar.bz2?Signature=up6%2BtTpSF... In the UTF-8 locale, wget correctly redirects to that location. In the ASCII locale, wget -d print a converted: '...' - '...' line (from iri.c's do_conversion), then redirects to: tar.bz2?Signature=up6+tTpSF... (If you try it yourself you'll get a slightly different URL, but at least for me it usually contains %2B somewhere.) This appears to be because do_conversion calls url_unescape on the input string it's given -- even though that input string is a _const_ char * in the code that calls it (main - retrieve_url - url_parse - remote_to_utf8 - do_conversion). It's not immediately obvious to me whether that's intentional or not; at the very least, it's a surprising bit of behaviour. That call to url_unescape() is necessary because iconv() needs the multibyte characters with no encoding. My first approach, by the way, was to remove that call, but that caused Test-iri-percent.px to fail, which is pretty clear. The issue seems to be at the call to reencode_escapes(), just after remote_to_utf8() returns. The problem here is that %2B resolves to + (literal). And that character is equal to the reserved character +, and reencode_escapes() treats it as a reserved characters and leaves it as-is. The same happens with other characters, such as = (%3D). What I propose is to tag the characters that have been decoded, in url_unescape(), and then in reencode_escapes(), verify if they coincide with reserved characters as well. What do you guys think? Without looking at the code right now and from what you describe above, your proposal sounds like a good idea. This problem pops up again and again. If you solve the issue, some people will love you :-) Regards, Tim As promised, here it goes. This works to me, although I'm expecting to send a test case in the following days. I read RFC 3987 on which iri.c is based, and it proposed a better approach than mine for this specific case, concretely, in section 3.2 Converting URIs to IRIs. Thus, I decided to implement that approach, which basically says that characters in reserved should *not* be unescaped prior to converting to UTF-8. signature.asc Description: This is a digitally signed message part.
Re: [Bug-wget] Redirect containing %2B behaves differently depending on locale
On 04/03/2015 02:16 PM, Tim Rühsen wrote: Hi Ander, Am Freitag, 3. April 2015, 12:26:09 schrieb Ander Juaristi: On 03/13/2015 11:48 PM, Adam Sampson wrote: Hi, I've just found a case where wget 1.16.3 responds to a 302 redirect differently depending on whether it's in an ASCII or UTF-8 locale. This works: LC_ALL=en_GB.UTF-8 wget https://bitbucket.org/pypy/pypy/downloads/pypy-2.5.0-src.tar.bz2 This doesn't work: LC_ALL=C wget https://bitbucket.org/pypy/pypy/downloads/pypy-2.5.0-src.tar.bz2 I've attached logs with -d showing what's actually going on. The initial request gives a 302 response with a Location: that contains: tar.bz2?Signature=up6%2BtTpSF... In the UTF-8 locale, wget correctly redirects to that location. In the ASCII locale, wget -d print a converted: '...' - '...' line (from iri.c's do_conversion), then redirects to: tar.bz2?Signature=up6+tTpSF... (If you try it yourself you'll get a slightly different URL, but at least for me it usually contains %2B somewhere.) This appears to be because do_conversion calls url_unescape on the input string it's given -- even though that input string is a _const_ char * in the code that calls it (main - retrieve_url - url_parse - remote_to_utf8 - do_conversion). It's not immediately obvious to me whether that's intentional or not; at the very least, it's a surprising bit of behaviour. That call to url_unescape() is necessary because iconv() needs the multibyte characters with no encoding. My first approach, by the way, was to remove that call, but that caused Test-iri-percent.px to fail, which is pretty clear. The issue seems to be at the call to reencode_escapes(), just after remote_to_utf8() returns. The problem here is that %2B resolves to + (literal). And that character is equal to the reserved character +, and reencode_escapes() treats it as a reserved characters and leaves it as-is. The same happens with other characters, such as = (%3D). What I propose is to tag the characters that have been decoded, in url_unescape(), and then in reencode_escapes(), verify if they coincide with reserved characters as well. What do you guys think? Without looking at the code right now and from what you describe above, your proposal sounds like a good idea. This problem pops up again and again. If you solve the issue, some people will love you :-) Regards, Tim As promised, here it goes. This works to me, although I'm expecting to send a test case in the following days. I read RFC 3987 on which iri.c is based, and it proposed a better approach than mine for this specific case, concretely, in section 3.2 Converting URIs to IRIs. Thus, I decided to implement that approach, which basically says that characters in reserved should *not* be unescaped prior to converting to UTF-8. -- Regards, - AJ From 5a3f0714a14a5d9554185d72da337be471f25484 Mon Sep 17 00:00:00 2001 From: Ander Juaristi ajuari...@gmx.es Date: Mon, 13 Apr 2015 16:28:36 +0200 Subject: [PATCH] Fixed incorrect handling of reserved characters when converting to UTF-8. * src/url.c (url_unescape_1): New static function. (url_unescape): Calls url_unescape_1 with mask zero. Preserves same behavior as before. Only code changes. (url_unescape_except_reserved): New function. Unescapes unsafe characters only (leaves reserved characters in their original percent encoding). --- src/iri.c | 2 +- src/url.c | 40 +--- 2 files changed, 30 insertions(+), 12 deletions(-) diff --git a/src/iri.c b/src/iri.c index 5278dee..10ae994 100644 --- a/src/iri.c +++ b/src/iri.c @@ -136,7 +136,7 @@ do_conversion (const char *tocode, const char *fromcode, char const *in_org, siz /* iconv() has to work on an unescaped string */ in_save = in = xstrndup (in_org, inlen); - url_unescape(in); + url_unescape_except_reserved (in); inlen = strlen(in); len = outlen = inlen * 2; diff --git a/src/url.c b/src/url.c index e78dbc6..73c8dd0 100644 --- a/src/url.c +++ b/src/url.c @@ -161,17 +161,8 @@ static const unsigned char urlchr_table[256] = #undef U #undef RU -/* URL-unescape the string S. - - This is done by transforming the sequences %HH to the character - represented by the hexadecimal digits HH. If % is not followed by - two hexadecimal digits, it is inserted literally. - - The transformation is done in place. If you need the original - string intact, make a copy before calling this function. */ - -void -url_unescape (char *s) +static void +url_unescape_1 (char *s, unsigned char mask) { char *t = s; /* t - tortoise */ char *h = s; /* h - hare */ @@ -190,6 +181,8 @@ url_unescape (char *s) if (!h[1] || !h[2] || !(c_isxdigit (h[1]) c_isxdigit (h[2]))) goto copychar; c = X2DIGITS_TO_NUM (h[1], h[2]); + if (urlchr_test(c, mask)) +goto copychar; /* Don't unescape %00 because there is no way to insert it into a
Re: [Bug-wget] Redirect containing %2B behaves differently depending on locale
Hi Ander, Am Freitag, 3. April 2015, 12:26:09 schrieb Ander Juaristi: On 03/13/2015 11:48 PM, Adam Sampson wrote: Hi, I've just found a case where wget 1.16.3 responds to a 302 redirect differently depending on whether it's in an ASCII or UTF-8 locale. This works: LC_ALL=en_GB.UTF-8 wget https://bitbucket.org/pypy/pypy/downloads/pypy-2.5.0-src.tar.bz2 This doesn't work: LC_ALL=C wget https://bitbucket.org/pypy/pypy/downloads/pypy-2.5.0-src.tar.bz2 I've attached logs with -d showing what's actually going on. The initial request gives a 302 response with a Location: that contains: tar.bz2?Signature=up6%2BtTpSF... In the UTF-8 locale, wget correctly redirects to that location. In the ASCII locale, wget -d print a converted: '...' - '...' line (from iri.c's do_conversion), then redirects to: tar.bz2?Signature=up6+tTpSF... (If you try it yourself you'll get a slightly different URL, but at least for me it usually contains %2B somewhere.) This appears to be because do_conversion calls url_unescape on the input string it's given -- even though that input string is a _const_ char * in the code that calls it (main - retrieve_url - url_parse - remote_to_utf8 - do_conversion). It's not immediately obvious to me whether that's intentional or not; at the very least, it's a surprising bit of behaviour. That call to url_unescape() is necessary because iconv() needs the multibyte characters with no encoding. My first approach, by the way, was to remove that call, but that caused Test-iri-percent.px to fail, which is pretty clear. The issue seems to be at the call to reencode_escapes(), just after remote_to_utf8() returns. The problem here is that %2B resolves to + (literal). And that character is equal to the reserved character +, and reencode_escapes() treats it as a reserved characters and leaves it as-is. The same happens with other characters, such as = (%3D). What I propose is to tag the characters that have been decoded, in url_unescape(), and then in reencode_escapes(), verify if they coincide with reserved characters as well. What do you guys think? Without looking at the code right now and from what you describe above, your proposal sounds like a good idea. This problem pops up again and again. If you solve the issue, some people will love you :-) Regards, Tim signature.asc Description: This is a digitally signed message part.
Re: [Bug-wget] Redirect containing %2B behaves differently depending on locale
On 03/13/2015 11:48 PM, Adam Sampson wrote: Hi, I've just found a case where wget 1.16.3 responds to a 302 redirect differently depending on whether it's in an ASCII or UTF-8 locale. This works: LC_ALL=en_GB.UTF-8 wget https://bitbucket.org/pypy/pypy/downloads/pypy-2.5.0-src.tar.bz2 This doesn't work: LC_ALL=C wget https://bitbucket.org/pypy/pypy/downloads/pypy-2.5.0-src.tar.bz2 I've attached logs with -d showing what's actually going on. The initial request gives a 302 response with a Location: that contains: tar.bz2?Signature=up6%2BtTpSF... In the UTF-8 locale, wget correctly redirects to that location. In the ASCII locale, wget -d print a converted: '...' - '...' line (from iri.c's do_conversion), then redirects to: tar.bz2?Signature=up6+tTpSF... (If you try it yourself you'll get a slightly different URL, but at least for me it usually contains %2B somewhere.) This appears to be because do_conversion calls url_unescape on the input string it's given -- even though that input string is a _const_ char * in the code that calls it (main - retrieve_url - url_parse - remote_to_utf8 - do_conversion). It's not immediately obvious to me whether that's intentional or not; at the very least, it's a surprising bit of behaviour. That call to url_unescape() is necessary because iconv() needs the multibyte characters with no encoding. My first approach, by the way, was to remove that call, but that caused Test-iri-percent.px to fail, which is pretty clear. The issue seems to be at the call to reencode_escapes(), just after remote_to_utf8() returns. The problem here is that %2B resolves to + (literal). And that character is equal to the reserved character +, and reencode_escapes() treats it as a reserved characters and leaves it as-is. The same happens with other characters, such as = (%3D). What I propose is to tag the characters that have been decoded, in url_unescape(), and then in reencode_escapes(), verify if they coincide with reserved characters as well. What do you guys think? -- Regards, - AJ
Re: [Bug-wget] Redirect containing %2B behaves differently depending on locale
On 04/03/2015 02:16 PM, Tim Rühsen wrote: Without looking at the code right now and from what you describe above, your proposal sounds like a good idea. This problem pops up again and again. If you solve the issue, some people will love you :-) Dealing back and forth with encoding is usually tricky ;-) I'll post some patches next week. Regards, - AJ
[Bug-wget] Redirect containing %2B behaves differently depending on locale
Hi, I've just found a case where wget 1.16.3 responds to a 302 redirect differently depending on whether it's in an ASCII or UTF-8 locale. This works: LC_ALL=en_GB.UTF-8 wget https://bitbucket.org/pypy/pypy/downloads/pypy-2.5.0-src.tar.bz2 This doesn't work: LC_ALL=C wget https://bitbucket.org/pypy/pypy/downloads/pypy-2.5.0-src.tar.bz2 I've attached logs with -d showing what's actually going on. The initial request gives a 302 response with a Location: that contains: tar.bz2?Signature=up6%2BtTpSF... In the UTF-8 locale, wget correctly redirects to that location. In the ASCII locale, wget -d print a converted: '...' - '...' line (from iri.c's do_conversion), then redirects to: tar.bz2?Signature=up6+tTpSF... (If you try it yourself you'll get a slightly different URL, but at least for me it usually contains %2B somewhere.) This appears to be because do_conversion calls url_unescape on the input string it's given -- even though that input string is a _const_ char * in the code that calls it (main - retrieve_url - url_parse - remote_to_utf8 - do_conversion). It's not immediately obvious to me whether that's intentional or not; at the very least, it's a surprising bit of behaviour. Thanks, -- Adam Sampson a...@offog.org http://offog.org/ DEBUG output created by Wget 1.16.3 on linux-gnu. URI encoding = 'ANSI_X3.4-1968' converted 'https://bitbucket.org/pypy/pypy/downloads/pypy-2.5.0-src.tar.bz2' (ANSI_X3.4-1968) - 'https://bitbucket.org/pypy/pypy/downloads/pypy-2.5.0-src.tar.bz2' (UTF-8) --2015-03-13 22:03:25-- https://bitbucket.org/pypy/pypy/downloads/pypy-2.5.0-src.tar.bz2 Certificates loaded: 174 Resolving bitbucket.org (bitbucket.org)... 131.103.20.168, 131.103.20.167 Caching bitbucket.org = 131.103.20.168 131.103.20.167 Connecting to bitbucket.org (bitbucket.org)|131.103.20.168|:443... connected. Created socket 4. Releasing 0x7f8ec60bc5f0 (new refcount 1). ---request begin--- GET /pypy/pypy/downloads/pypy-2.5.0-src.tar.bz2 HTTP/1.1 User-Agent: Wget/1.16.3 (linux-gnu) Accept: */* Accept-Encoding: identity Host: bitbucket.org Connection: Keep-Alive ---request end--- HTTP request sent, awaiting response... ---response begin--- HTTP/1.1 302 FOUND Server: nginx/1.6.2 Date: Fri, 13 Mar 2015 22:03:25 GMT Content-Type: text/html; charset=utf-8 Content-Length: 0 Connection: keep-alive X-Served-By: app22 X-Render-Time: 0.0194919109344 Content-Language: en ETag: d41d8cd98f00b204e9800998ecf8427e Location: https://bbuseruploads.s3.amazonaws.com/pypy/pypy/downloads/pypy-2.5.0-src.tar.bz2?Signature=up6%2BtTpSFVJ2E%2F0sD6JvfNQkeJg%3DExpires=1426285236AWSAccessKeyId=0EMWEFSGA12Z1HF1TZ82response-content-disposition=attachment%3B%20filename%3D%22pypy-2.5.0-src.tar.bz2%22 Cache-Control: max-age=0 X-Request-Count: 218 Expires: Fri, 13 Mar 2015 22:03:25 GMT Vary: Accept-Language, Cookie X-Frame-Options: SAMEORIGIN Last-Modified: Fri, 13 Mar 2015 22:03:25 GMT X-Static-Version: c524695e7f84 X-Version: c524695e7f84 Strict-Transport-Security: max-age=31536000 X-Content-Type-Options: nosniff X-Cache-Status: MISS ---response end--- 302 FOUND Registered socket 4 for persistent reuse. URI content encoding = 'utf-8' Location: https://bbuseruploads.s3.amazonaws.com/pypy/pypy/downloads/pypy-2.5.0-src.tar.bz2?Signature=up6%2BtTpSFVJ2E%2F0sD6JvfNQkeJg%3DExpires=1426285236AWSAccessKeyId=0EMWEFSGA12Z1HF1TZ82response-content-disposition=attachment%3B%20filename%3D%22pypy-2.5.0-src.tar.bz2%22 [following] ] done. URI content encoding = None converted 'https://bbuseruploads.s3.amazonaws.com/pypy/pypy/downloads/pypy-2.5.0-src.tar.bz2?Signature=up6%2BtTpSFVJ2E%2F0sD6JvfNQkeJg%3DExpires=1426285236AWSAccessKeyId=0EMWEFSGA12Z1HF1TZ82response-content-disposition=attachment%3B%20filename%3D%22pypy-2.5.0-src.tar.bz2%22' (ANSI_X3.4-1968) - 'https://bbuseruploads.s3.amazonaws.com/pypy/pypy/downloads/pypy-2.5.0-src.tar.bz2?Signature=up6+tTpSFVJ2E/0sD6JvfNQkeJg=Expires=1426285236AWSAccessKeyId=0EMWEFSGA12Z1HF1TZ82response-content-disposition=attachment; filename=pypy-2.5.0-src.tar.bz2' (UTF-8) --2015-03-13 22:03:25-- https://bbuseruploads.s3.amazonaws.com/pypy/pypy/downloads/pypy-2.5.0-src.tar.bz2?Signature=up6+tTpSFVJ2E/0sD6JvfNQkeJg=Expires=1426285236AWSAccessKeyId=0EMWEFSGA12Z1HF1TZ82response-content-disposition=attachment;%20filename=%22pypy-2.5.0-src.tar.bz2%22 Resolving bbuseruploads.s3.amazonaws.com (bbuseruploads.s3.amazonaws.com)... 54.231.2.89 Caching bbuseruploads.s3.amazonaws.com = 54.231.2.89 Connecting to bbuseruploads.s3.amazonaws.com (bbuseruploads.s3.amazonaws.com)|54.231.2.89|:443... connected. Created socket 5. Releasing 0x7f8ec6577ea0 (new refcount 1). ---request begin--- GET /pypy/pypy/downloads/pypy-2.5.0-src.tar.bz2?Signature=up6+tTpSFVJ2E/0sD6JvfNQkeJg=Expires=1426285236AWSAccessKeyId=0EMWEFSGA12Z1HF1TZ82response-content-disposition=attachment;%20filename=%22pypy-2.5.0-src.tar.bz2%22 HTTP/1.1 User-Agent: Wget/1.16.3 (linux-gnu) Accept: */* Accept-Encoding: