Hi,

attached you can find a patch that proposes a change to the file warc.c.
The change will use url_escape to escape reserved characters in the
redirect_location. Up to the current version (1.19) wget (with warc and
warc-cdx flags) will write the redirect_location unescaped. If that
contains whitespaces (e.g. unescaped error messages or oauth scope
information) it is nearly impossible to parse as wget uses whitespaces as
field separators.

The sample cdx writer published by internetarchive (
https://github.com/internetarchive/CDX-Writer) also uses url encoding on
the redirect_location.

Best Regards
Christof Horschitz
--- warc.c	2016-09-07 11:35:24.000000000 +0200
+++ warc_new.c	2017-03-22 08:32:28.395540715 +0100
@@ -32,6 +32,7 @@
 #include "utils.h"
 #include "version.h"
 #include "dirname.h"
+#include "url.h"
 
 #include <stdio.h>
 #include <stdlib.h>
@@ -1365,6 +1366,8 @@
     mime_type = "-";
   if (redirect_location == NULL || strlen(redirect_location) == 0)
     redirect_location = "-";
+  else
+    redirect_location = url_escape(redirect_location);
 
   number_to_string (offset_string, offset);
 

Reply via email to