Re: [Bug-wget] Redirect containing %2B behaves differently depending on locale

2015-04-20 Thread Tim Ruehsen
Hi Ander,

sorry to answer so late. I was waiting for the mentioned test case... ;-)

Could you please fix this warning (add a prototype):
iri.c: In function 'do_conversion':
iri.c:139:3: warning: implicit declaration of function 
'url_unescape_except_reserved' [-Wimplicit-function-declaration]
   url_unescape_except_reserved (in);
   ^
url.c:217:1: warning: no previous prototype for 'url_unescape_except_reserved' 
[-Wmissing-prototypes]
 url_unescape_except_reserved (char *s)
 ^

Regards, Tim

On Monday 13 April 2015 17:03:23 Ander Juaristi wrote:
 On 04/03/2015 02:16 PM, Tim Rühsen wrote:
  Hi Ander,
  
  Am Freitag, 3. April 2015, 12:26:09 schrieb Ander Juaristi:
  On 03/13/2015 11:48 PM, Adam Sampson wrote:
  Hi,
  
  I've just found a case where wget 1.16.3 responds to a 302 redirect
  differently depending on whether it's in an ASCII or UTF-8 locale.
  
  This works:
  LC_ALL=en_GB.UTF-8 wget
  https://bitbucket.org/pypy/pypy/downloads/pypy-2.5.0-src.tar.bz2
  
  This doesn't work:
  LC_ALL=C wget
  https://bitbucket.org/pypy/pypy/downloads/pypy-2.5.0-src.tar.bz2
  
  I've attached logs with -d showing what's actually going on. The
  
  initial request gives a 302 response with a Location: that contains:
  tar.bz2?Signature=up6%2BtTpSF...
  
  In the UTF-8 locale, wget correctly redirects to that location.
  
  In the ASCII locale, wget -d print a converted: '...' - '...' line
  
  (from iri.c's do_conversion), then redirects to:
  tar.bz2?Signature=up6+tTpSF...
  
  (If you try it yourself you'll get a slightly different URL, but at
  least for me it usually contains %2B somewhere.)
  
  This appears to be because do_conversion calls url_unescape on the
  input string it's given -- even though that input string is a _const_
  char * in the code that calls it (main - retrieve_url - url_parse -
  remote_to_utf8 - do_conversion). It's not immediately obvious to me
  whether that's intentional or not; at the very least, it's a surprising
  bit of behaviour.
  
  That call to url_unescape() is necessary because iconv() needs the
  multibyte characters with no encoding. My first approach, by the way,
  was to remove that call, but that caused Test-iri-percent.px to fail,
  which is pretty clear.
  
  The issue seems to be at the call to reencode_escapes(), just after
  remote_to_utf8() returns. The problem here is that %2B resolves to +
  (literal). And that character is equal to the reserved character +, and
  reencode_escapes() treats it as a reserved characters and leaves it
  as-is.
  The same happens with other characters, such as = (%3D).
  
  What I propose is to tag the characters that have been decoded, in
  url_unescape(), and then in reencode_escapes(), verify if they coincide
  with reserved characters as well.
  
  What do you guys think?
  
  Without looking at the code right now and from what you describe above,
  your proposal sounds like a good idea. This problem pops up again and
  again. If you solve the issue, some people will love you :-)
  
  Regards, Tim
 
 As promised, here it goes.
 
 This works to me, although I'm expecting to send a test case in the
 following days.
 
 I read RFC 3987 on which iri.c is based, and it proposed a better approach
 than mine for this specific case, concretely, in section 3.2 Converting
 URIs to IRIs. Thus, I decided to implement that approach, which basically
 says that characters in reserved should *not* be unescaped prior to
 converting to UTF-8.


signature.asc
Description: This is a digitally signed message part.


Re: [Bug-wget] Redirect containing %2B behaves differently depending on locale

2015-04-13 Thread Ander Juaristi

On 04/03/2015 02:16 PM, Tim Rühsen wrote:

Hi Ander,

Am Freitag, 3. April 2015, 12:26:09 schrieb Ander Juaristi:

On 03/13/2015 11:48 PM, Adam Sampson wrote:

Hi,

I've just found a case where wget 1.16.3 responds to a 302 redirect
differently depending on whether it's in an ASCII or UTF-8 locale.

This works:
LC_ALL=en_GB.UTF-8 wget
https://bitbucket.org/pypy/pypy/downloads/pypy-2.5.0-src.tar.bz2

This doesn't work:
LC_ALL=C wget
https://bitbucket.org/pypy/pypy/downloads/pypy-2.5.0-src.tar.bz2

I've attached logs with -d showing what's actually going on. The

initial request gives a 302 response with a Location: that contains:
tar.bz2?Signature=up6%2BtTpSF...

In the UTF-8 locale, wget correctly redirects to that location.

In the ASCII locale, wget -d print a converted: '...' - '...' line

(from iri.c's do_conversion), then redirects to:
tar.bz2?Signature=up6+tTpSF...

(If you try it yourself you'll get a slightly different URL, but at
least for me it usually contains %2B somewhere.)

This appears to be because do_conversion calls url_unescape on the
input string it's given -- even though that input string is a _const_
char * in the code that calls it (main - retrieve_url - url_parse -
remote_to_utf8 - do_conversion). It's not immediately obvious to me
whether that's intentional or not; at the very least, it's a surprising
bit of behaviour.


That call to url_unescape() is necessary because iconv() needs the multibyte
characters with no encoding. My first approach, by the way, was to remove
that call, but that caused Test-iri-percent.px to fail, which is pretty
clear.

The issue seems to be at the call to reencode_escapes(), just after
remote_to_utf8() returns. The problem here is that %2B resolves to +
(literal). And that character is equal to the reserved character +, and
reencode_escapes() treats it as a reserved characters and leaves it as-is.
The same happens with other characters, such as = (%3D).

What I propose is to tag the characters that have been decoded, in
url_unescape(), and then in reencode_escapes(), verify if they coincide
with reserved characters as well.

What do you guys think?


Without looking at the code right now and from what you describe above, your
proposal sounds like a good idea. This problem pops up again and again. If you
solve the issue, some people will love you :-)

Regards, Tim


As promised, here it goes.

This works to me, although I'm expecting to send a test case in the following 
days.

I read RFC 3987 on which iri.c is based, and it proposed a better approach than 
mine for this specific case,
concretely, in section 3.2 Converting URIs to IRIs. Thus, I decided to 
implement that approach, which basically
says that characters in reserved should *not* be unescaped prior to 
converting to UTF-8.

--
Regards,
- AJ
From 5a3f0714a14a5d9554185d72da337be471f25484 Mon Sep 17 00:00:00 2001
From: Ander Juaristi ajuari...@gmx.es
Date: Mon, 13 Apr 2015 16:28:36 +0200
Subject: [PATCH] Fixed incorrect handling of reserved characters when
 converting to UTF-8.

* src/url.c (url_unescape_1): New static function.
(url_unescape): Calls url_unescape_1 with mask zero. Preserves
same behavior as before. Only code changes.
(url_unescape_except_reserved): New function. Unescapes unsafe characters only
(leaves reserved characters in their original percent encoding).
---
 src/iri.c |  2 +-
 src/url.c | 40 +---
 2 files changed, 30 insertions(+), 12 deletions(-)

diff --git a/src/iri.c b/src/iri.c
index 5278dee..10ae994 100644
--- a/src/iri.c
+++ b/src/iri.c
@@ -136,7 +136,7 @@ do_conversion (const char *tocode, const char *fromcode, char const *in_org, siz
 
   /* iconv() has to work on an unescaped string */
   in_save = in = xstrndup (in_org, inlen);
-  url_unescape(in);
+  url_unescape_except_reserved (in);
   inlen = strlen(in);
 
   len = outlen = inlen * 2;
diff --git a/src/url.c b/src/url.c
index e78dbc6..73c8dd0 100644
--- a/src/url.c
+++ b/src/url.c
@@ -161,17 +161,8 @@ static const unsigned char urlchr_table[256] =
 #undef U
 #undef RU
 
-/* URL-unescape the string S.
-
-   This is done by transforming the sequences %HH to the character
-   represented by the hexadecimal digits HH.  If % is not followed by
-   two hexadecimal digits, it is inserted literally.
-
-   The transformation is done in place.  If you need the original
-   string intact, make a copy before calling this function.  */
-
-void
-url_unescape (char *s)
+static void
+url_unescape_1 (char *s, unsigned char mask)
 {
   char *t = s;  /* t - tortoise */
   char *h = s;  /* h - hare */
@@ -190,6 +181,8 @@ url_unescape (char *s)
   if (!h[1] || !h[2] || !(c_isxdigit (h[1])  c_isxdigit (h[2])))
 goto copychar;
   c = X2DIGITS_TO_NUM (h[1], h[2]);
+  if (urlchr_test(c, mask))
+goto copychar;
   /* Don't unescape %00 because there is no way to insert it
  into a 

Re: [Bug-wget] Redirect containing %2B behaves differently depending on locale

2015-04-03 Thread Tim Rühsen
Hi Ander,

Am Freitag, 3. April 2015, 12:26:09 schrieb Ander Juaristi:
 On 03/13/2015 11:48 PM, Adam Sampson wrote:
  Hi,
  
  I've just found a case where wget 1.16.3 responds to a 302 redirect
  differently depending on whether it's in an ASCII or UTF-8 locale.
  
  This works:
  LC_ALL=en_GB.UTF-8 wget
  https://bitbucket.org/pypy/pypy/downloads/pypy-2.5.0-src.tar.bz2
  
  This doesn't work:
  LC_ALL=C wget
  https://bitbucket.org/pypy/pypy/downloads/pypy-2.5.0-src.tar.bz2
  
  I've attached logs with -d showing what's actually going on. The
  
  initial request gives a 302 response with a Location: that contains:
 tar.bz2?Signature=up6%2BtTpSF...
  
  In the UTF-8 locale, wget correctly redirects to that location.
  
  In the ASCII locale, wget -d print a converted: '...' - '...' line
  
  (from iri.c's do_conversion), then redirects to:
 tar.bz2?Signature=up6+tTpSF...
  
  (If you try it yourself you'll get a slightly different URL, but at
  least for me it usually contains %2B somewhere.)
  
  This appears to be because do_conversion calls url_unescape on the
  input string it's given -- even though that input string is a _const_
  char * in the code that calls it (main - retrieve_url - url_parse -
  remote_to_utf8 - do_conversion). It's not immediately obvious to me
  whether that's intentional or not; at the very least, it's a surprising
  bit of behaviour.
 
 That call to url_unescape() is necessary because iconv() needs the multibyte
 characters with no encoding. My first approach, by the way, was to remove
 that call, but that caused Test-iri-percent.px to fail, which is pretty
 clear.
 
 The issue seems to be at the call to reencode_escapes(), just after
 remote_to_utf8() returns. The problem here is that %2B resolves to +
 (literal). And that character is equal to the reserved character +, and
 reencode_escapes() treats it as a reserved characters and leaves it as-is.
 The same happens with other characters, such as = (%3D).
 
 What I propose is to tag the characters that have been decoded, in
 url_unescape(), and then in reencode_escapes(), verify if they coincide
 with reserved characters as well.
 
 What do you guys think?

Without looking at the code right now and from what you describe above, your 
proposal sounds like a good idea. This problem pops up again and again. If you 
solve the issue, some people will love you :-)

Regards, Tim


signature.asc
Description: This is a digitally signed message part.


Re: [Bug-wget] Redirect containing %2B behaves differently depending on locale

2015-04-03 Thread Ander Juaristi

On 03/13/2015 11:48 PM, Adam Sampson wrote:

Hi,

I've just found a case where wget 1.16.3 responds to a 302 redirect
differently depending on whether it's in an ASCII or UTF-8 locale.

This works:
LC_ALL=en_GB.UTF-8 wget 
https://bitbucket.org/pypy/pypy/downloads/pypy-2.5.0-src.tar.bz2

This doesn't work:
LC_ALL=C wget https://bitbucket.org/pypy/pypy/downloads/pypy-2.5.0-src.tar.bz2

I've attached logs with -d showing what's actually going on. The
initial request gives a 302 response with a Location: that contains:
   tar.bz2?Signature=up6%2BtTpSF...

In the UTF-8 locale, wget correctly redirects to that location.

In the ASCII locale, wget -d print a converted: '...' - '...' line
(from iri.c's do_conversion), then redirects to:
   tar.bz2?Signature=up6+tTpSF...

(If you try it yourself you'll get a slightly different URL, but at
least for me it usually contains %2B somewhere.)

This appears to be because do_conversion calls url_unescape on the
input string it's given -- even though that input string is a _const_
char * in the code that calls it (main - retrieve_url - url_parse -
remote_to_utf8 - do_conversion). It's not immediately obvious to me
whether that's intentional or not; at the very least, it's a surprising
bit of behaviour.

That call to url_unescape() is necessary because iconv() needs the multibyte 
characters with no encoding. My first approach, by the way, was to remove that 
call, but that caused Test-iri-percent.px to fail, which is pretty clear.

The issue seems to be at the call to reencode_escapes(), just after remote_to_utf8() returns. The problem 
here is that %2B resolves to + (literal). And that character is equal to the reserved character 
+, and reencode_escapes() treats it as a reserved characters and leaves it as-is. The same 
happens with other characters, such as = (%3D).

What I propose is to tag the characters that have been decoded, in 
url_unescape(), and then in reencode_escapes(), verify if they coincide with 
reserved characters as well.

What do you guys think?

--
Regards,
- AJ



Re: [Bug-wget] Redirect containing %2B behaves differently depending on locale

2015-04-03 Thread Ander Juaristi

On 04/03/2015 02:16 PM, Tim Rühsen wrote:

Without looking at the code right now and from what you describe above, your
proposal sounds like a good idea. This problem pops up again and again. If you
solve the issue, some people will love you :-)


Dealing back and forth with encoding is usually tricky ;-)

I'll post some patches next week.

Regards,

- AJ




[Bug-wget] Redirect containing %2B behaves differently depending on locale

2015-03-13 Thread Adam Sampson
Hi,

I've just found a case where wget 1.16.3 responds to a 302 redirect
differently depending on whether it's in an ASCII or UTF-8 locale.

This works:
LC_ALL=en_GB.UTF-8 wget 
https://bitbucket.org/pypy/pypy/downloads/pypy-2.5.0-src.tar.bz2

This doesn't work:
LC_ALL=C wget https://bitbucket.org/pypy/pypy/downloads/pypy-2.5.0-src.tar.bz2

I've attached logs with -d showing what's actually going on. The
initial request gives a 302 response with a Location: that contains:
  tar.bz2?Signature=up6%2BtTpSF...

In the UTF-8 locale, wget correctly redirects to that location.

In the ASCII locale, wget -d print a converted: '...' - '...' line
(from iri.c's do_conversion), then redirects to:
  tar.bz2?Signature=up6+tTpSF...

(If you try it yourself you'll get a slightly different URL, but at
least for me it usually contains %2B somewhere.)

This appears to be because do_conversion calls url_unescape on the
input string it's given -- even though that input string is a _const_
char * in the code that calls it (main - retrieve_url - url_parse -
remote_to_utf8 - do_conversion). It's not immediately obvious to me
whether that's intentional or not; at the very least, it's a surprising
bit of behaviour.

Thanks,

-- 
Adam Sampson a...@offog.org http://offog.org/
DEBUG output created by Wget 1.16.3 on linux-gnu.

URI encoding = 'ANSI_X3.4-1968'
converted 'https://bitbucket.org/pypy/pypy/downloads/pypy-2.5.0-src.tar.bz2' 
(ANSI_X3.4-1968) - 
'https://bitbucket.org/pypy/pypy/downloads/pypy-2.5.0-src.tar.bz2' (UTF-8)
--2015-03-13 22:03:25--  
https://bitbucket.org/pypy/pypy/downloads/pypy-2.5.0-src.tar.bz2
Certificates loaded: 174
Resolving bitbucket.org (bitbucket.org)... 131.103.20.168, 131.103.20.167
Caching bitbucket.org = 131.103.20.168 131.103.20.167
Connecting to bitbucket.org (bitbucket.org)|131.103.20.168|:443... connected.
Created socket 4.
Releasing 0x7f8ec60bc5f0 (new refcount 1).

---request begin---
GET /pypy/pypy/downloads/pypy-2.5.0-src.tar.bz2 HTTP/1.1
User-Agent: Wget/1.16.3 (linux-gnu)
Accept: */*
Accept-Encoding: identity
Host: bitbucket.org
Connection: Keep-Alive

---request end---
HTTP request sent, awaiting response... 
---response begin---
HTTP/1.1 302 FOUND
Server: nginx/1.6.2
Date: Fri, 13 Mar 2015 22:03:25 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 0
Connection: keep-alive
X-Served-By: app22
X-Render-Time: 0.0194919109344
Content-Language: en
ETag: d41d8cd98f00b204e9800998ecf8427e
Location: 
https://bbuseruploads.s3.amazonaws.com/pypy/pypy/downloads/pypy-2.5.0-src.tar.bz2?Signature=up6%2BtTpSFVJ2E%2F0sD6JvfNQkeJg%3DExpires=1426285236AWSAccessKeyId=0EMWEFSGA12Z1HF1TZ82response-content-disposition=attachment%3B%20filename%3D%22pypy-2.5.0-src.tar.bz2%22
Cache-Control: max-age=0
X-Request-Count: 218
Expires: Fri, 13 Mar 2015 22:03:25 GMT
Vary: Accept-Language, Cookie
X-Frame-Options: SAMEORIGIN
Last-Modified: Fri, 13 Mar 2015 22:03:25 GMT
X-Static-Version: c524695e7f84
X-Version: c524695e7f84
Strict-Transport-Security: max-age=31536000
X-Content-Type-Options: nosniff
X-Cache-Status: MISS

---response end---
302 FOUND
Registered socket 4 for persistent reuse.
URI content encoding = 'utf-8'
Location: 
https://bbuseruploads.s3.amazonaws.com/pypy/pypy/downloads/pypy-2.5.0-src.tar.bz2?Signature=up6%2BtTpSFVJ2E%2F0sD6JvfNQkeJg%3DExpires=1426285236AWSAccessKeyId=0EMWEFSGA12Z1HF1TZ82response-content-disposition=attachment%3B%20filename%3D%22pypy-2.5.0-src.tar.bz2%22
 [following]
] done.
URI content encoding = None
converted 
'https://bbuseruploads.s3.amazonaws.com/pypy/pypy/downloads/pypy-2.5.0-src.tar.bz2?Signature=up6%2BtTpSFVJ2E%2F0sD6JvfNQkeJg%3DExpires=1426285236AWSAccessKeyId=0EMWEFSGA12Z1HF1TZ82response-content-disposition=attachment%3B%20filename%3D%22pypy-2.5.0-src.tar.bz2%22'
 (ANSI_X3.4-1968) - 
'https://bbuseruploads.s3.amazonaws.com/pypy/pypy/downloads/pypy-2.5.0-src.tar.bz2?Signature=up6+tTpSFVJ2E/0sD6JvfNQkeJg=Expires=1426285236AWSAccessKeyId=0EMWEFSGA12Z1HF1TZ82response-content-disposition=attachment;
 filename=pypy-2.5.0-src.tar.bz2' (UTF-8)
--2015-03-13 22:03:25--  
https://bbuseruploads.s3.amazonaws.com/pypy/pypy/downloads/pypy-2.5.0-src.tar.bz2?Signature=up6+tTpSFVJ2E/0sD6JvfNQkeJg=Expires=1426285236AWSAccessKeyId=0EMWEFSGA12Z1HF1TZ82response-content-disposition=attachment;%20filename=%22pypy-2.5.0-src.tar.bz2%22
Resolving bbuseruploads.s3.amazonaws.com (bbuseruploads.s3.amazonaws.com)... 
54.231.2.89
Caching bbuseruploads.s3.amazonaws.com = 54.231.2.89
Connecting to bbuseruploads.s3.amazonaws.com 
(bbuseruploads.s3.amazonaws.com)|54.231.2.89|:443... connected.
Created socket 5.
Releasing 0x7f8ec6577ea0 (new refcount 1).

---request begin---
GET 
/pypy/pypy/downloads/pypy-2.5.0-src.tar.bz2?Signature=up6+tTpSFVJ2E/0sD6JvfNQkeJg=Expires=1426285236AWSAccessKeyId=0EMWEFSGA12Z1HF1TZ82response-content-disposition=attachment;%20filename=%22pypy-2.5.0-src.tar.bz2%22
 HTTP/1.1
User-Agent: Wget/1.16.3 (linux-gnu)
Accept: */*
Accept-Encoding: