Re: [Bug-wget] Submitting a patch fixing the --content-disposition option of wget

Vladimír Pýcha Fri, 14 Feb 2014 06:59:46 -0800

Ok, attached is the new patch file with all edits mentioned in ChangeLog.

Cheers,
Vlad



On Fri, Feb 14, 2014 at 1:12 PM, Darshit Shah <[email protected]> wrote:

> That seems like a much cleaner patch thanks!!
>
> We usually mention every edit in the ChangeLog. Those missing are trivial.
> It would be great if you could those too. Specifically, cookies.c, and
> other files which you have changed, even of to change the function header.
>
> Else, Giuseppe while applying it may do it.
>
>
> On Fri, Feb 14, 2014 at 12:59 PM, Vladimír Pýcha <[email protected]> wrote:
>
>> Hi Darshit,
>> Thanks for your suggestions, I followed them. Please find attached the
>> improved patch, created with `git format-patch -1`.
>>
>> Cheers,
>> Vlad
>>
>>
>>
>> On Wed, Feb 12, 2014 at 4:33 PM, Darshit Shah <[email protected]> wrote:
>>
>>>
>>> On Wed, Feb 12, 2014 at 9:48 AM, Vladimír Pýcha <[email protected]>wrote:
>>>
>>>> Hello,
>>>> I have created a small patch for wget 1.15. It is related to the
>>>> experimental --content-disposition option. The patch is attached.
>>>>
>>>> This patch ensures that if the filename parameter of Content-Disposition
>>>> header is url-encoded as described in RFC 2231, then wget decodes it
>>>> (the
>>>> character set is ignored).
>>>>
>>>> Besides using the --content-disposition option, there may be also
>>>> needed to
>>>> use the --restrict-file-names=nocontrol option to avoid wget from
>>>> escaping
>>>> some characters of the filename.
>>>>
>>>> I have created a testing URL:
>>>> http://www.vladpride.cz/res/content-disposition-with-unicode.php
>>>>
>>>> The source code of the PHP script at that URL is as follows:
>>>>
>>>> <?php
>>>> $str = 'aácč dďsš zžrř iínň';
>>>> header('Content-Type: text/plain; charset=UTF-8');
>>>> header("Content-Disposition: attachment;
>>>> filename*=UTF-8''".rawurlencode($str.'.txt'));
>>>> echo $str;
>>>> ?>
>>>>
>>>> You can use the following command to test it:
>>>>
>>>> wget --content-disposition --restrict-file-names=nocontrol "
>>>> http://www.vladpride.cz/res/content-disposition-with-unicode.php";
>>>>
>>>> Using the above command, my patched version of wget correctly saves the
>>>> file as:
>>>>
>>>> aácč dďsš zžrř iínň.txt
>>>>
>>>> Using the same command with unpatched wget, the file is incorrectly
>>>> saved
>>>> as:
>>>>
>>>> a%C3%A1c%C4%8D%20d%C4%8Fs%C5%A1%20z%C5%BEr%C5%99%20i%C3%ADn%C5%88.txt
>>>>
>>>> It would be great if my patch could be incorporated into the next
>>>> release
>>>> of wget.
>>>>
>>>> I have created the patch file with the following command:
>>>>
>>>> diff -rupN wget-1.15/src/ wget-1.15-custom/src/ >
>>>> content-disposition.patch
>>>>
>>>> You can apply the patch with the following command, while in the
>>>> directory
>>>> where the source code tarball was extracted to:
>>>>
>>>> patch -p1 < ../content-disposition.patch
>>>>
>>>> Then the output will be like this:
>>>>
>>>> patching file src/http.c
>>>> patching file src/http.h
>>>> patching file src/url.c
>>>> patching file src/url.h
>>>>
>>>> Note that in my system, Ubuntu 12.04, I had to install package
>>>> libgnutls-dev to be able to compile wget.
>>>>
>>>> Cheers,
>>>> Vlad
>>>>
>>>
>>> Hi,
>>>
>>> Thanks for your patch! However, it would be great if you could add an
>>> entry to the ChangeLog in the git tree and recreate the patch file. Also,
>>> it would be easier to simply create the patch using `git format-patch -1`
>>>
>>> You can find the git repository for Wget at:
>>> http://git.savannah.gnu.org/cgit/wget.git
>>>
>>> Some comments about the code inline. (Gmail sometimes mangles lines. So
>>> I'm sorry if this looks ugly)
>>>
>>> iff -rupN wget-1.15/src/http.c wget-1.15-custom/src/http.c
>>> --- wget-1.15/src/http.c        2014-01-07 15:58:49.000000000 +0100
>>> +++ wget-1.15-custom/src/http.c 2014-02-12 08:27:28.713919909 +0100
>>> @@ -1066,7 +1066,20 @@ bool
>>>  extract_param (const char **source, param_token *name, param_token
>>> *value,
>>>                 char separator)
>>>  {
>>> +  return extract_param_new (source, name, value, separator, (bool *) 0);
>>> +}
>>> +
>>> +/* Like extract_param, but with the addition of parameter
>>> is_url_encoded that is set to true if the value is url-encoded (see RFC
>>> 2231 for details). */
>>>
>>> Your comment exceeds the 80 characters per line convention. It would be
>>> great if you could fix this.
>>>
>>> extract_param_new is *not* like extract_param. Because extract_param
>>> simply calls extract_param_new. It is like the old version of
>>> extract_param, but a new reader does not know that. Hence, simply move the
>>> old comment to your new function.
>>>
>>> +
>>> +bool
>>> +extract_param_new (const char **source, param_token *name, param_token
>>> *value,
>>> +               char separator, bool *is_url_encoded)
>>>
>>> I am not sure why you need to define extract_param_new. Why is not
>>> editing the former extract_param() method a better idea? Sure, you would be
>>> required to edit all invokations of the method in the source code, but such
>>> refactoring isn't difficult.
>>>
>>> +{
>>>    const char *p = *source;
>>> +  if (is_url_encoded)
>>> +    {
>>> +      *is_url_encoded = false;
>>> +    }
>>>
>>> Am I correct in interpreting that you are simply forcing the value of
>>> is_url_encoded to be false? Why have a parameter in that case? I might be
>>> wrong, but please explain this.
>>>
>>>    while (c_isspace (*p)) ++p;
>>>    if (!*p)
>>> @@ -1125,6 +1138,9 @@ extract_param (const char **source, para
>>>    int param_type = modify_param_name(name);
>>>    if (NOT_RFC2231 != param_type)
>>>      {
>>> +      if (RFC2231_ENCODING == param_type && is_url_encoded) {
>>> +        *is_url_encoded = true;
>>> +      }
>>>        modify_param_value(value, param_type);
>>>      }
>>>    return true;
>>> @@ -1137,13 +1153,17 @@ extract_param (const char **source, para
>>>  /* Appends the string represented by VALUE to FILENAME */
>>>
>>>  static void
>>> -append_value_to_filename (char **filename, param_token const * const
>>> value)
>>> +append_value_to_filename (char **filename, param_token const * const
>>> value, bool is_url_encoded)
>>>  {
>>>    int original_length = strlen(*filename);
>>>    int new_length = strlen(*filename) + (value->e - value->b);
>>>    *filename = xrealloc (*filename, new_length+1);
>>>    memcpy (*filename + original_length, value->b, (value->e -
>>> value->b));
>>>    (*filename)[new_length] = '\0';
>>> +  if (is_url_encoded)
>>> +    {
>>> +      url_unescape(*filename + original_length);
>>> +    }
>>>  }
>>>
>>>  #undef MAX
>>> @@ -1176,7 +1196,8 @@ parse_content_disposition (const char *h
>>>  {
>>>    param_token name, value;
>>>    *filename = NULL;
>>> -  while (extract_param (&hdr, &name, &value, ';'))
>>> +  bool is_url_encoded;
>>> +  for (is_url_encoded = false; extract_param_new (&hdr, &name, &value,
>>> ';', &is_url_encoded); is_url_encoded = false)
>>>      {
>>>        int isFilename = BOUNDED_EQUAL_NO_CASE ( name.b, name.e,
>>> "filename" );
>>>        if ( isFilename && value.b != NULL)
>>> @@ -1192,9 +1213,15 @@ parse_content_disposition (const char *h
>>>              continue;
>>>
>>>            if (*filename)
>>> -            append_value_to_filename (filename, &value);
>>> +            append_value_to_filename (filename, &value, is_url_encoded);
>>>            else
>>> -            *filename = strdupdelim (value.b, value.e);
>>> +            {
>>> +              *filename = strdupdelim (value.b, value.e);
>>> +              if (is_url_encoded)
>>> +                {
>>> +                  url_unescape (*filename);
>>> +                }
>>> +            }
>>>          }
>>>      }
>>>
>>> diff -rupN wget-1.15/src/http.h wget-1.15-custom/src/http.h
>>> --- wget-1.15/src/http.h        2014-01-04 13:49:47.000000000 +0100
>>> +++ wget-1.15-custom/src/http.h 2014-02-12 08:27:28.725919927 +0100
>>> @@ -44,6 +44,7 @@ typedef struct {
>>>    const char *b, *e;
>>>  } param_token;
>>>  bool extract_param (const char **, param_token *, param_token *, char);
>>> +bool extract_param_new (const char **, param_token *, param_token *,
>>> char, bool *);
>>>
>>>
>>>  #endif /* HTTP_H */
>>> diff -rupN wget-1.15/src/url.c wget-1.15-custom/src/url.c
>>> --- wget-1.15/src/url.c 2014-01-04 13:49:47.000000000 +0100
>>> +++ wget-1.15-custom/src/url.c  2014-02-12 08:27:28.725919927 +0100
>>> @@ -169,7 +169,7 @@ static const unsigned char urlchr_table[
>>>     The transformation is done in place.  If you need the original
>>>     string intact, make a copy before calling this function.  */
>>>
>>> -static void
>>> +void
>>>  url_unescape (char *s)
>>>  {
>>>    char *t = s;                  /* t - tortoise */
>>> diff -rupN wget-1.15/src/url.h wget-1.15-custom/src/url.h
>>> --- wget-1.15/src/url.h 2013-10-21 16:50:12.000000000 +0200
>>> +++ wget-1.15-custom/src/url.h  2014-02-12 08:27:28.725919927 +0100
>>> @@ -100,6 +100,7 @@ struct url
>>>  /* Function declarations */
>>>
>>>  char *url_escape (const char *);
>>> +void url_unescape (char *);
>>>  char *url_escape_unsafe_and_reserved (const char *);
>>>
>>>  struct url *url_parse (const char *, int *, struct iri *iri, bool
>>> percent_encode);
>>>
>>>
>>>
>>> --
>>> Thanking You,
>>> Darshit Shah
>>>
>>>
>>
>
>
> --
> Thanking You,
> Darshit Shah
>
>

From 0a8066e0ce3a06db7d25f98e54450609c19c8b23 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Vladim=C3=ADr=20P=C3=BDcha?= <[email protected]>
Date: Fri, 14 Feb 2014 15:44:51 +0100
Subject: [PATCH] URL-decode the filename parameter of Content-Disposition
 HTTP header if it is encoded

---
 src/ChangeLog |   15 +++++++++++++++
 src/cookies.c |    6 +++---
 src/http.c    |   33 ++++++++++++++++++++++++++-------
 src/http.h    |    2 +-
 src/url.c     |    2 +-
 src/url.h     |    1 +
 6 files changed, 47 insertions(+), 12 deletions(-)

diff --git a/src/ChangeLog b/src/ChangeLog
index b7b6753..1b9dc14 100644
--- a/src/ChangeLog
+++ b/src/ChangeLog
@@ -1,3 +1,18 @@
+2014-02-14  Vladimír Pýcha  <[email protected]>
+
+	* http.c (parse_content_disposition, extract_param)
+	(append_value_to_filename, digest_authentication_encode): URL-decode the
+	filename parameter of Content-Disposition HTTP header if it is encoded. This
+	is related to --content-disposition.
+	New parameter of extract_param(), "is_url_encoded".
+	Add argument NULL to the call of extract_param() in
+	digest_authentication_encode().
+	* http.h: Add the new parameter to the declaration of extract_param().
+	* cookies.c (parse_set_cookie, test_cookies): Add argument NULL to the calls
+	of extract_param().
+	* url.c (url_unescape): Remove "static" modifier.
+	* url.h: Add declaration of url_unescape().
+
 2014-02-06  Giuseppe Scrivano  <[email protected]>
 
 	* main.c (print_version): Move copyright year out of the localized
diff --git a/src/cookies.c b/src/cookies.c
index 4efda88..6ba7b5a 100644
--- a/src/cookies.c
+++ b/src/cookies.c
@@ -346,7 +346,7 @@ parse_set_cookie (const char *set_cookie, bool silent)
   struct cookie *cookie = cookie_new ();
   param_token name, value;
 
-  if (!extract_param (&ptr, &name, &value, ';'))
+  if (!extract_param (&ptr, &name, &value, ';', NULL))
     goto error;
   if (!value.b)
     goto error;
@@ -360,7 +360,7 @@ parse_set_cookie (const char *set_cookie, bool silent)
   cookie->attr = strdupdelim (name.b, name.e);
   cookie->value = strdupdelim (value.b, value.e);
 
-  while (extract_param (&ptr, &name, &value, ';'))
+  while (extract_param (&ptr, &name, &value, ';', NULL))
     {
       if (TOKEN_IS (name, "domain"))
         {
@@ -1377,7 +1377,7 @@ test_cookies (void)
         param_token name, value;
         const char *ptr = data;
         int j = 0;
-        while (extract_param (&ptr, &name, &value, ';'))
+        while (extract_param (&ptr, &name, &value, ';', NULL))
           {
             char *n = strdupdelim (name.b, name.e);
             char *v = strdupdelim (value.b, value.e);
diff --git a/src/http.c b/src/http.c
index 5715df6..690fcde 100644
--- a/src/http.c
+++ b/src/http.c
@@ -1060,13 +1060,21 @@ modify_param_value (param_token *value, int encoding_type )
    filename=\"foo bar\"", the first call to this function will return
    the token named "attachment" and no value, and the second call will
    return the token named "filename" and value "foo bar".  The third
-   call will return false, indicating no more valid tokens.  */
+   call will return false, indicating no more valid tokens.
+
+   is_url_encoded is an out parameter. If not NULL, a boolean value will be
+   stored into it, letting the caller know whether or not the extracted value is
+   URL-encoded. The caller can then decode it with url_unescape(), which however
+   performs decoding in-place. URL-encoding is used by RFC 2231 to support
+   non-US-ASCII characters in HTTP header values.  */
 
 bool
 extract_param (const char **source, param_token *name, param_token *value,
-               char separator)
+               char separator, bool *is_url_encoded)
 {
   const char *p = *source;
+  if (is_url_encoded)
+    *is_url_encoded = false;   /* initializing the out parameter */
 
   while (c_isspace (*p)) ++p;
   if (!*p)
@@ -1125,6 +1133,8 @@ extract_param (const char **source, param_token *name, param_token *value,
   int param_type = modify_param_name(name);
   if (NOT_RFC2231 != param_type)
     {
+      if (RFC2231_ENCODING == param_type && is_url_encoded)
+        *is_url_encoded = true;
       modify_param_value(value, param_type);
     }
   return true;
@@ -1137,13 +1147,16 @@ extract_param (const char **source, param_token *name, param_token *value,
 /* Appends the string represented by VALUE to FILENAME */
 
 static void
-append_value_to_filename (char **filename, param_token const * const value)
+append_value_to_filename (char **filename, param_token const * const value,
+                          bool is_url_encoded)
 {
   int original_length = strlen(*filename);
   int new_length = strlen(*filename) + (value->e - value->b);
   *filename = xrealloc (*filename, new_length+1);
   memcpy (*filename + original_length, value->b, (value->e - value->b)); 
   (*filename)[new_length] = '\0';
+  if (is_url_encoded)
+    url_unescape (*filename + original_length);
 }
 
 #undef MAX
@@ -1176,7 +1189,9 @@ parse_content_disposition (const char *hdr, char **filename)
 {
   param_token name, value;
   *filename = NULL;
-  while (extract_param (&hdr, &name, &value, ';'))
+  bool is_url_encoded = false;
+  for ( ; extract_param (&hdr, &name, &value, ';', &is_url_encoded);
+        is_url_encoded = false)
     {
       int isFilename = BOUNDED_EQUAL_NO_CASE ( name.b, name.e, "filename" );
       if ( isFilename && value.b != NULL)
@@ -1192,9 +1207,13 @@ parse_content_disposition (const char *hdr, char **filename)
             continue;
 
           if (*filename)
-            append_value_to_filename (filename, &value);
+            append_value_to_filename (filename, &value, is_url_encoded);
           else
-            *filename = strdupdelim (value.b, value.e);
+            {
+              *filename = strdupdelim (value.b, value.e);
+              if (is_url_encoded)
+                url_unescape (*filename);
+            }
         }
     }
 
@@ -3730,7 +3749,7 @@ digest_authentication_encode (const char *au, const char *user,
   realm = opaque = nonce = algorithm = qop = NULL;
 
   au += 6;                      /* skip over `Digest' */
-  while (extract_param (&au, &name, &value, ','))
+  while (extract_param (&au, &name, &value, ',', NULL))
     {
       size_t i;
       size_t namelen = name.e - name.b;
diff --git a/src/http.h b/src/http.h
index 389456b..21f1ed5 100644
--- a/src/http.h
+++ b/src/http.h
@@ -43,7 +43,7 @@ typedef struct {
   /* A token consists of characters in the [b, e) range. */
   const char *b, *e;
 } param_token;
-bool extract_param (const char **, param_token *, param_token *, char);
+bool extract_param (const char **, param_token *, param_token *, char, bool *);
 
 
 #endif /* HTTP_H */
diff --git a/src/url.c b/src/url.c
index f554432..f32c726 100644
--- a/src/url.c
+++ b/src/url.c
@@ -169,7 +169,7 @@ static const unsigned char urlchr_table[256] =
    The transformation is done in place.  If you need the original
    string intact, make a copy before calling this function.  */
 
-static void
+void
 url_unescape (char *s)
 {
   char *t = s;                  /* t - tortoise */
diff --git a/src/url.h b/src/url.h
index cd3782b..6d18ed8 100644
--- a/src/url.h
+++ b/src/url.h
@@ -101,6 +101,7 @@ struct url
 
 char *url_escape (const char *);
 char *url_escape_unsafe_and_reserved (const char *);
+void url_unescape (char *);
 
 struct url *url_parse (const char *, int *, struct iri *iri, bool percent_encode);
 char *url_error (const char *, int);
-- 
1.7.9.5

Re: [Bug-wget] Submitting a patch fixing the --content-disposition option of wget

Reply via email to