Am Freitag, 29. März 2013 schrieb Andy Jackson: > When using wget 1.14 to generate warc.gz files, e.g. > > wget -O tempname --warc-file="output" "http://example.com" > > the files this creates do not play back well using the Internet Archives > warc.gz parsers, throwing errors like > > "Invalid FExtra length/records". > > It appears wget may be creating slightly malformed GZIP skip-length > fields - see > > https://github.com/ukwa/warc-discovery/issues/1 > > for details. > > It's likely that we'll need to make the warc.gz parsers a bit more > robust, but I thought I'd mention it here in case this is > actually a bug in wget. > > Thanks for your time. > > Andy Jackson
Just a very quick test (before I go to bed) shows an unexpected behaviour to me: $ wget -O tempname --warc-file="output" "http://example.com" results in a 5065 bytes file 'output.warc.gz' Unzipping it and zipping it again results in a 2387 byte file. So, for a first glimpse, it looks like Wget compresses very suboptimal. But I won't say it is a bug before I take a deeper look... (in the next days). Regards Tim
