Re: [Bug-wget] [PATCH 3/3] Redirect containing %2B behaves differently depending on locale

2015-05-12 Thread Tim Rühsen
Am Mittwoch, 22. April 2015, 16:14:06 schrieb Ander Juaristi:
 On 04/22/2015 03:47 PM, Ander Juaristi wrote:
  On 04/21/2015 04:19 PM, Darshit Shah wrote:
  Regarding the patch itself, I wanted to ask if it would not be cleaner to
  dig into the code and replace every call to url_unescape with the new
  prototype? In my opinion that would help in maintaining readability and
  more importantly maintainability of the code. 
  I thought of it too, and I agree with you. The reason I haven't done it is
  because I'm not really sure whether all the functions that call
  url_unescape need the reserved characters escaped or not. I believe
  there'll be no problems, but I didn't want to just blindly replace all
  the calls to url_unescape without even having a quick look, which is
  exactly what I didn't have time to do so far. What do you guys think?
  I'll have a closer look as soon as I can (and provided no one does it
  before) and roll another patch with the replacements. Unless of course
  someone already knows the answer.
  
  Regarding the patches, I resend them with the changes made according to
  your feedback. 
  Changes made so far:
   - Merged the prototype patch into 1.
   - Shortened commit messages.
   - New test added to Makefile.am (in patch 2).
 
 Forgot to mention some files. Silly me :-(

Thanks Ander !

I pushed your patches.

Tim


signature.asc
Description: This is a digitally signed message part.


Re: [Bug-wget] [PATCH 3/3] Redirect containing %2B behaves differently depending on locale

2015-04-22 Thread Ander Juaristi

On 04/21/2015 04:19 PM, Darshit Shah wrote:

Regarding the patch itself, I wanted to ask if it would not be cleaner to dig 
into the code and replace every call to url_unescape with the new prototype? In 
my opinion that would help in maintaining readability and more importantly 
maintainability of the code.


I thought of it too, and I agree with you. The reason I haven't done it is 
because I'm not really sure whether all the functions that call url_unescape 
need the reserved characters escaped or not. I believe there'll be no problems, 
but I didn't want to just blindly replace all the calls to url_unescape without 
even having a quick look, which is exactly what I didn't have time to do so 
far. What do you guys think?
I'll have a closer look as soon as I can (and provided no one does it before) 
and roll another patch with the replacements. Unless of course someone already 
knows the answer.

Regarding the patches, I resend them with the changes made according to your 
feedback.
Changes made so far:
- Merged the prototype patch into 1.
- Shortened commit messages.
- New test added to Makefile.am (in patch 2).

--
Regards,
- AJ
From 0123e79976aa2ec35f9610fbf3094457b689d091 Mon Sep 17 00:00:00 2001
From: Ander Juaristi ajuari...@gmx.es
Date: Mon, 13 Apr 2015 16:28:36 +0200
Subject: [PATCH 1/2] Fixed incorrect handling of reserved chars.

* src/url.c (url_unescape_1): New static function.
(url_unescape): Calls url_unescape_1 with mask zero. Preserves
same behavior as before. Only code changes.
(url_unescape_except_reserved): New function.

* src/url.h: Added prototype for url_unescape_except_reserved().

When the locale is US-ASCII, URIs that contain special characters
in them are converted to IRIs according to RFC 3987, section 3.2
Converting URIs to IRIs.
---
 src/iri.c |  2 +-
 src/url.c | 40 +---
 src/url.h |  1 +
 3 files changed, 31 insertions(+), 12 deletions(-)

diff --git a/src/iri.c b/src/iri.c
index 5278dee..10ae994 100644
--- a/src/iri.c
+++ b/src/iri.c
@@ -136,7 +136,7 @@ do_conversion (const char *tocode, const char *fromcode, char const *in_org, siz
 
   /* iconv() has to work on an unescaped string */
   in_save = in = xstrndup (in_org, inlen);
-  url_unescape(in);
+  url_unescape_except_reserved (in);
   inlen = strlen(in);
 
   len = outlen = inlen * 2;
diff --git a/src/url.c b/src/url.c
index e78dbc6..73c8dd0 100644
--- a/src/url.c
+++ b/src/url.c
@@ -161,17 +161,8 @@ static const unsigned char urlchr_table[256] =
 #undef U
 #undef RU
 
-/* URL-unescape the string S.
-
-   This is done by transforming the sequences %HH to the character
-   represented by the hexadecimal digits HH.  If % is not followed by
-   two hexadecimal digits, it is inserted literally.
-
-   The transformation is done in place.  If you need the original
-   string intact, make a copy before calling this function.  */
-
-void
-url_unescape (char *s)
+static void
+url_unescape_1 (char *s, unsigned char mask)
 {
   char *t = s;  /* t - tortoise */
   char *h = s;  /* h - hare */
@@ -190,6 +181,8 @@ url_unescape (char *s)
   if (!h[1] || !h[2] || !(c_isxdigit (h[1])  c_isxdigit (h[2])))
 goto copychar;
   c = X2DIGITS_TO_NUM (h[1], h[2]);
+  if (urlchr_test(c, mask))
+goto copychar;
   /* Don't unescape %00 because there is no way to insert it
  into a C string without effectively truncating it. */
   if (c == '\0')
@@ -201,6 +194,31 @@ url_unescape (char *s)
   *t = '\0';
 }
 
+/* URL-unescape the string S.
+
+   This is done by transforming the sequences %HH to the character
+   represented by the hexadecimal digits HH.  If % is not followed by
+   two hexadecimal digits, it is inserted literally.
+
+   The transformation is done in place.  If you need the original
+   string intact, make a copy before calling this function.  */
+void
+url_unescape (char *s)
+{
+  url_unescape_1 (s, 0);
+}
+
+/* URL-unescape the string S.
+
+   This functions behaves identically as url_unescape(), but does not
+   convert characters from reserved. In other words, it only converts
+   unsafe characters.  */
+void
+url_unescape_except_reserved (char *s)
+{
+  url_unescape_1 (s, urlchr_reserved);
+}
+
 /* The core of url_escape_* functions.  Escapes the characters that
match the provided mask in urlchr_table.
 
diff --git a/src/url.h b/src/url.h
index b1c46c1..a543f3d 100644
--- a/src/url.h
+++ b/src/url.h
@@ -106,6 +106,7 @@ struct url
 char *url_escape (const char *);
 char *url_escape_unsafe_and_reserved (const char *);
 void url_unescape (char *);
+void url_unescape_except_reserved (char *);
 
 struct url *url_parse (const char *, int *, struct iri *iri, bool percent_encode);
 char *url_error (const char *, int);
-- 
1.9.1

From 2e295c70bd89f2d40380404ae7a2745a7b0c6075 Mon Sep 17 00:00:00 2001
From: Ander Juaristi ajuari...@gmx.es
Date: Mon, 20 Apr 2015 23:16:18 

Re: [Bug-wget] [PATCH 3/3] Redirect containing %2B behaves differently depending on locale

2015-04-22 Thread Ander Juaristi

On 04/22/2015 03:47 PM, Ander Juaristi wrote:

On 04/21/2015 04:19 PM, Darshit Shah wrote:

Regarding the patch itself, I wanted to ask if it would not be cleaner to dig 
into the code and replace every call to url_unescape with the new prototype? In 
my opinion that would help in maintaining readability and more importantly 
maintainability of the code.


I thought of it too, and I agree with you. The reason I haven't done it is 
because I'm not really sure whether all the functions that call url_unescape 
need the reserved characters escaped or not. I believe there'll be no problems, 
but I didn't want to just blindly replace all the calls to url_unescape without 
even having a quick look, which is exactly what I didn't have time to do so 
far. What do you guys think?
I'll have a closer look as soon as I can (and provided no one does it before) 
and roll another patch with the replacements. Unless of course someone already 
knows the answer.

Regarding the patches, I resend them with the changes made according to your 
feedback.
Changes made so far:
 - Merged the prototype patch into 1.
 - Shortened commit messages.
 - New test added to Makefile.am (in patch 2).



Forgot to mention some files. Silly me :-(

--
Regards,
- AJ
From 0123e79976aa2ec35f9610fbf3094457b689d091 Mon Sep 17 00:00:00 2001
From: Ander Juaristi ajuari...@gmx.es
Date: Mon, 13 Apr 2015 16:28:36 +0200
Subject: [PATCH 1/2] Fixed incorrect handling of reserved chars.

* src/iri.c (do_conversion): Call url_unescape_except_reserved,
instead of url_unescape.

* src/url.c (url_unescape_1): New static function.
(url_unescape): Calls url_unescape_1 with mask zero. Preserves
same behavior as before. Only code changes.
(url_unescape_except_reserved): New function.

* src/url.h: Added prototype for url_unescape_except_reserved().

When the locale is US-ASCII, URIs that contain special characters
in them are converted to IRIs according to RFC 3987, section 3.2
Converting URIs to IRIs.
---
 src/iri.c |  2 +-
 src/url.c | 40 +---
 src/url.h |  1 +
 3 files changed, 31 insertions(+), 12 deletions(-)

diff --git a/src/iri.c b/src/iri.c
index 5278dee..10ae994 100644
--- a/src/iri.c
+++ b/src/iri.c
@@ -136,7 +136,7 @@ do_conversion (const char *tocode, const char *fromcode, char const *in_org, siz
 
   /* iconv() has to work on an unescaped string */
   in_save = in = xstrndup (in_org, inlen);
-  url_unescape(in);
+  url_unescape_except_reserved (in);
   inlen = strlen(in);
 
   len = outlen = inlen * 2;
diff --git a/src/url.c b/src/url.c
index e78dbc6..73c8dd0 100644
--- a/src/url.c
+++ b/src/url.c
@@ -161,17 +161,8 @@ static const unsigned char urlchr_table[256] =
 #undef U
 #undef RU
 
-/* URL-unescape the string S.
-
-   This is done by transforming the sequences %HH to the character
-   represented by the hexadecimal digits HH.  If % is not followed by
-   two hexadecimal digits, it is inserted literally.
-
-   The transformation is done in place.  If you need the original
-   string intact, make a copy before calling this function.  */
-
-void
-url_unescape (char *s)
+static void
+url_unescape_1 (char *s, unsigned char mask)
 {
   char *t = s;  /* t - tortoise */
   char *h = s;  /* h - hare */
@@ -190,6 +181,8 @@ url_unescape (char *s)
   if (!h[1] || !h[2] || !(c_isxdigit (h[1])  c_isxdigit (h[2])))
 goto copychar;
   c = X2DIGITS_TO_NUM (h[1], h[2]);
+  if (urlchr_test(c, mask))
+goto copychar;
   /* Don't unescape %00 because there is no way to insert it
  into a C string without effectively truncating it. */
   if (c == '\0')
@@ -201,6 +194,31 @@ url_unescape (char *s)
   *t = '\0';
 }
 
+/* URL-unescape the string S.
+
+   This is done by transforming the sequences %HH to the character
+   represented by the hexadecimal digits HH.  If % is not followed by
+   two hexadecimal digits, it is inserted literally.
+
+   The transformation is done in place.  If you need the original
+   string intact, make a copy before calling this function.  */
+void
+url_unescape (char *s)
+{
+  url_unescape_1 (s, 0);
+}
+
+/* URL-unescape the string S.
+
+   This functions behaves identically as url_unescape(), but does not
+   convert characters from reserved. In other words, it only converts
+   unsafe characters.  */
+void
+url_unescape_except_reserved (char *s)
+{
+  url_unescape_1 (s, urlchr_reserved);
+}
+
 /* The core of url_escape_* functions.  Escapes the characters that
match the provided mask in urlchr_table.
 
diff --git a/src/url.h b/src/url.h
index b1c46c1..a543f3d 100644
--- a/src/url.h
+++ b/src/url.h
@@ -106,6 +106,7 @@ struct url
 char *url_escape (const char *);
 char *url_escape_unsafe_and_reserved (const char *);
 void url_unescape (char *);
+void url_unescape_except_reserved (char *);
 
 struct url *url_parse (const char *, int *, struct iri *iri, bool percent_encode);
 char