I'd like to apply the following patch to GNU Wget master branch. I'm looking for comments, especially by those that care about Wget's WARC implementation. Do you think this is okay to apply in terms of the broader WARC ecosystem? With this patch, Wget will continue to generate WARC 1.0 files, but with the angled brackets removed. Should we change the version to WARC 1.1 now? While not feature complete, I do believe that the files Wget generates are WARC 1.1 compliant.
=== Wget has historically been one of the only implementations of the WARC 1.0 standard that actually printed the URI enclosed in the angled brackets. This was identified as an errata and removed from the WARC 1.1 specification. However, since Wget hasn't updated its implementation it has continued to create old-style WARC files with the angled brackets. Let's remove this and start generated WARC files without the angled brackets. This does mean that Wget is now no longer completely compliant with either the WARC 1.0 or WARC 1.1 standards. But since most WARC libraries support the reading of such files, it should not be a problem. * src/warc.c: Remove `warc_write_header_uri` and replace all usages with `warc_write_header` --- src/warc.c | 24 ++++-------------------- 1 file changed, 4 insertions(+), 20 deletions(-) diff --git a/src/warc.c b/src/warc.c index 230bd36f..bbc825f7 100644 --- a/src/warc.c +++ b/src/warc.c @@ -270,22 +270,6 @@ warc_write_header (const char *name, const char *value) return warc_write_ok; } -/* Writes a WARC header with a URI as value to the current WARC record. - This method may be run after warc_write_start_record and - before warc_write_block_from_file. */ -static bool -warc_write_header_uri (const char *name, const char *value) -{ - if (value) - { - warc_write_string (name); - warc_write_string (": <"); - warc_write_string (value); - warc_write_string (">\r\n"); - } - return warc_write_ok; -} - /* Copies the contents of DATA_IN to the WARC record. Adds a Content-Length header to the WARC record. Run this method after warc_write_header, @@ -1339,7 +1323,7 @@ warc_write_request_record (const char *url, const char *timestamp_str, { warc_write_start_record (); warc_write_header ("WARC-Type", "request"); - warc_write_header_uri ("WARC-Target-URI", url); + warc_write_header ("WARC-Target-URI", url); warc_write_header ("Content-Type", "application/http;msgtype=request"); warc_write_date_header (timestamp_str); warc_write_header ("WARC-Record-ID", record_uuid); @@ -1448,7 +1432,7 @@ warc_write_revisit_record (const char *url, const char *timestamp_str, warc_write_header ("WARC-Refers-To", refers_to); warc_write_header ("WARC-Profile", "http://netpreserve.org/warc/1.0/revisit/identical-payload-digest"); warc_write_header ("WARC-Truncated", "length"); - warc_write_header_uri ("WARC-Target-URI", url); + warc_write_header ("WARC-Target-URI", url); warc_write_date_header (timestamp_str); warc_write_ip_header (ip); warc_write_header ("Content-Type", "application/http;msgtype=response"); @@ -1540,7 +1524,7 @@ warc_write_response_record (const char *url, const char *timestamp_str, warc_write_header ("WARC-Record-ID", response_uuid); warc_write_header ("WARC-Warcinfo-ID", warc_current_warcinfo_uuid_str); warc_write_header ("WARC-Concurrent-To", concurrent_to_uuid); - warc_write_header_uri ("WARC-Target-URI", url); + warc_write_header ("WARC-Target-URI", url); warc_write_date_header (timestamp_str); warc_write_ip_header (ip); warc_write_header ("WARC-Block-Digest", block_digest); @@ -1597,7 +1581,7 @@ warc_write_record (const char *record_type, const char *resource_uuid, warc_write_header ("WARC-Record-ID", resource_uuid); warc_write_header ("WARC-Warcinfo-ID", warc_current_warcinfo_uuid_str); warc_write_header ("WARC-Concurrent-To", concurrent_to_uuid); - warc_write_header_uri ("WARC-Target-URI", url); + warc_write_header ("WARC-Target-URI", url); warc_write_date_header (timestamp_str); warc_write_ip_header (ip); warc_write_digest_headers (body, payload_offset); -- 2.47.0