On 04/03/2015 02:16 PM, Tim Rühsen wrote:
Hi Ander,
Am Freitag, 3. April 2015, 12:26:09 schrieb Ander Juaristi:
On 03/13/2015 11:48 PM, Adam Sampson wrote:
Hi,
I've just found a case where wget 1.16.3 responds to a 302 redirect
differently depending on whether it's in an ASCII or UTF-8 locale.
This works:
LC_ALL=en_GB.UTF-8 wget
https://bitbucket.org/pypy/pypy/downloads/pypy-2.5.0-src.tar.bz2
This doesn't work:
LC_ALL=C wget
https://bitbucket.org/pypy/pypy/downloads/pypy-2.5.0-src.tar.bz2
I've attached logs with -d showing what's actually going on. The
initial request gives a 302 response with a Location: that contains:
....tar.bz2?Signature=up6%2BtTpSF...
In the UTF-8 locale, wget correctly redirects to that location.
In the ASCII locale, wget -d print a "converted: '...' -> '...'" line
(from iri.c's do_conversion), then redirects to:
....tar.bz2?Signature=up6+tTpSF...
(If you try it yourself you'll get a slightly different URL, but at
least for me it usually contains %2B somewhere.)
This appears to be because do_conversion calls url_unescape on the
input string it's given -- even though that input string is a _const_
char * in the code that calls it (main -> retrieve_url -> url_parse ->
remote_to_utf8 -> do_conversion). It's not immediately obvious to me
whether that's intentional or not; at the very least, it's a surprising
bit of behaviour.
That call to url_unescape() is necessary because iconv() needs the multibyte
characters with no encoding. My first approach, by the way, was to remove
that call, but that caused Test-iri-percent.px to fail, which is pretty
clear.
The issue seems to be at the call to reencode_escapes(), just after
remote_to_utf8() returns. The problem here is that %2B resolves to "+"
(literal). And that character is equal to the reserved character "+", and
reencode_escapes() treats it as a reserved characters and leaves it as-is.
The same happens with other characters, such as "=" (%3D).
What I propose is to tag the characters that have been decoded, in
url_unescape(), and then in reencode_escapes(), verify if they coincide
with reserved characters as well.
What do you guys think?
Without looking at the code right now and from what you describe above, your
proposal sounds like a good idea. This problem pops up again and again. If you
solve the issue, some people will love you :-)
Regards, Tim
As promised, here it goes.
This works to me, although I'm expecting to send a test case in the following
days.
I read RFC 3987 on which iri.c is based, and it proposed a better approach than
mine for this specific case,
concretely, in section 3.2 "Converting URIs to IRIs". Thus, I decided to
implement that approach, which basically
says that characters in "reserved" should *not* be unescaped prior to
converting to UTF-8.
--
Regards,
- AJ
>From 5a3f0714a14a5d9554185d72da337be471f25484 Mon Sep 17 00:00:00 2001
From: Ander Juaristi <[email protected]>
Date: Mon, 13 Apr 2015 16:28:36 +0200
Subject: [PATCH] Fixed incorrect handling of reserved characters when
converting to UTF-8.
* src/url.c (url_unescape_1): New static function.
(url_unescape): Calls url_unescape_1 with mask zero. Preserves
same behavior as before. Only code changes.
(url_unescape_except_reserved): New function. Unescapes unsafe characters only
(leaves reserved characters in their original percent encoding).
---
src/iri.c | 2 +-
src/url.c | 40 +++++++++++++++++++++++++++++-----------
2 files changed, 30 insertions(+), 12 deletions(-)
diff --git a/src/iri.c b/src/iri.c
index 5278dee..10ae994 100644
--- a/src/iri.c
+++ b/src/iri.c
@@ -136,7 +136,7 @@ do_conversion (const char *tocode, const char *fromcode, char const *in_org, siz
/* iconv() has to work on an unescaped string */
in_save = in = xstrndup (in_org, inlen);
- url_unescape(in);
+ url_unescape_except_reserved (in);
inlen = strlen(in);
len = outlen = inlen * 2;
diff --git a/src/url.c b/src/url.c
index e78dbc6..73c8dd0 100644
--- a/src/url.c
+++ b/src/url.c
@@ -161,17 +161,8 @@ static const unsigned char urlchr_table[256] =
#undef U
#undef RU
-/* URL-unescape the string S.
-
- This is done by transforming the sequences "%HH" to the character
- represented by the hexadecimal digits HH. If % is not followed by
- two hexadecimal digits, it is inserted literally.
-
- The transformation is done in place. If you need the original
- string intact, make a copy before calling this function. */
-
-void
-url_unescape (char *s)
+static void
+url_unescape_1 (char *s, unsigned char mask)
{
char *t = s; /* t - tortoise */
char *h = s; /* h - hare */
@@ -190,6 +181,8 @@ url_unescape (char *s)
if (!h[1] || !h[2] || !(c_isxdigit (h[1]) && c_isxdigit (h[2])))
goto copychar;
c = X2DIGITS_TO_NUM (h[1], h[2]);
+ if (urlchr_test(c, mask))
+ goto copychar;
/* Don't unescape %00 because there is no way to insert it
into a C string without effectively truncating it. */
if (c == '\0')
@@ -201,6 +194,31 @@ url_unescape (char *s)
*t = '\0';
}
+/* URL-unescape the string S.
+
+ This is done by transforming the sequences "%HH" to the character
+ represented by the hexadecimal digits HH. If % is not followed by
+ two hexadecimal digits, it is inserted literally.
+
+ The transformation is done in place. If you need the original
+ string intact, make a copy before calling this function. */
+void
+url_unescape (char *s)
+{
+ url_unescape_1 (s, 0);
+}
+
+/* URL-unescape the string S.
+
+ This functions behaves identically as url_unescape(), but does not
+ convert characters from "reserved". In other words, it only converts
+ "unsafe" characters. */
+void
+url_unescape_except_reserved (char *s)
+{
+ url_unescape_1 (s, urlchr_reserved);
+}
+
/* The core of url_escape_* functions. Escapes the characters that
match the provided mask in urlchr_table.
--
1.9.1