Hi,
> It appears wget may be creating slightly malformed GZIP skip-length
> fields
I think that's correct: Wget doesn't write the subfield length in the
"extra field" section of the header. After the subfield ID "sl" it
should write the length LEN (see RFC 1952 [1]), but it doesn't.
Luckily, it does write the correct length of all extra fields (XLEN in
the RFC 1952), so Gzip implementations that just ignore the extra field
can skip it without problems. This is the case for the GNU Gzip utility.
But it should be fixed. I've attached a patch.
> It's likely that we'll need to make the warc.gz parsers a bit more
> robust, but I thought I'd mention it here in case this is
> actually a bug in wget.
When I wrote the code for the extra field I used the old Hanzo
warc-tools [2] as an example. That implementation has the same problem:
it doesn't write the field length [3]. This means there's at least one
other tool that writes these off-spec warc.gz files, so it's probably
useful to make the parser a bit more robust.
Thanks,
Gijs
[1] http://www.gzip.org/zlib/rfc-gzip.html
[2] https://code.google.com/p/warc-tools/
[2]
https://code.google.com/p/warc-tools/source/browse/trunk/lib/private/wgzip.c#314
diff --git a/src/ChangeLog b/src/ChangeLog
index 8e1213f..65d636d 100644
--- a/src/ChangeLog
+++ b/src/ChangeLog
@@ -1,3 +1,8 @@
+2013-03-31 Gijs van Tulder <[email protected]>
+
+ * warc.c: Correctly write the field length in the skip length field
+ of .warc.gz files. (Following the GZIP spec in RFC 1952.)
+
2013-03-12 Darshit Shah <[email protected]>
* http.c (gethttp): Make wget return FILEBADFILE error and abort if
diff --git a/src/warc.c b/src/warc.c
index fb506a7..9b10610 100644
--- a/src/warc.c
+++ b/src/warc.c
@@ -165,7 +165,7 @@ warc_write_string (const char *str)
}
-#define EXTRA_GZIP_HEADER_SIZE 12
+#define EXTRA_GZIP_HEADER_SIZE 14
#define GZIP_STATIC_HEADER_SIZE 10
#define FLG_FEXTRA 0x04
#define OFF_FLG 3
@@ -200,7 +200,7 @@ warc_write_start_record (void)
In warc_write_end_record we will fill this space
with information about the uncompressed and
compressed size of the record. */
- fprintf (warc_current_file, "XXXXXXXXXXXX");
+ fseek (warc_current_file, EXTRA_GZIP_HEADER_SIZE, SEEK_CUR);
fflush (warc_current_file);
/* Start a new GZIP stream. */
@@ -342,16 +342,19 @@ warc_write_end_record (void)
/* The extra header field identifier for the WARC skip length. */
extra_header[2] = 's';
extra_header[3] = 'l';
+ /* The size of the field value (8 bytes). */
+ extra_header[4] = (8 & 255);
+ extra_header[5] = ((8 >> 8) & 255);
/* The size of the uncompressed record. */
- extra_header[4] = (uncompressed_size & 255);
- extra_header[5] = (uncompressed_size >> 8) & 255;
- extra_header[6] = (uncompressed_size >> 16) & 255;
- extra_header[7] = (uncompressed_size >> 24) & 255;
+ extra_header[6] = (uncompressed_size & 255);
+ extra_header[7] = (uncompressed_size >> 8) & 255;
+ extra_header[8] = (uncompressed_size >> 16) & 255;
+ extra_header[9] = (uncompressed_size >> 24) & 255;
/* The size of the compressed record. */
- extra_header[8] = (compressed_size & 255);
- extra_header[9] = (compressed_size >> 8) & 255;
- extra_header[10] = (compressed_size >> 16) & 255;
- extra_header[11] = (compressed_size >> 24) & 255;
+ extra_header[10] = (compressed_size & 255);
+ extra_header[11] = (compressed_size >> 8) & 255;
+ extra_header[12] = (compressed_size >> 16) & 255;
+ extra_header[13] = (compressed_size >> 24) & 255;
/* Write the extra header after the static header. */
fseeko (warc_current_file, warc_current_gzfile_offset