[ 
https://issues.apache.org/jira/browse/NUTCH-2213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Rau updated NUTCH-2213:
-----------------------------
    Description: 
I have downloaded [a WARC 
file|https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2015-40/segments/1443738099622.98/warc/CC-MAIN-20151001222139-00240-ip-10-137-6-227.ec2.internal.warc.gz]
 from the common crawl data. This file contains several gzipped responses which 
are stored plaintext (without the gzip encoding).

I used [warctools|https://github.com/internetarchive/warctools] from Internet 
Archive to extract the responses out of the WARC file. However this tool 
expects the Content-Length field to match the actual length of the body in the 
WARC ([See the issue on 
github|https://github.com/internetarchive/warctools/pull/14#issuecomment-182048962]).
 warctools uses a more up to date version of hanzo warctools which is 
recommended on the [Common Crawl 
website|https://commoncrawl.org/the-data/get-started/] under "Processing the 
file format".

I have not been using Nutch and can therefore not say which versions are 
affected by this.

After reading [the official WARC 
draft|http://archive-access.sourceforge.net/warc/warc_file_format-0.9.html] I 
could not find out how gzipped content is supposed to be stored. However 
probably multiple WARC file parsers will have an issue with this.

It would be nice to know whether you consider this a bug and plan on fixing 
this and whether this is a major issue which concerns most WARC files of the 
Common Crawl data or only a small part.

  was:
I have downloaded [a WARC 
file|https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2015-40/segments/1443738099622.98/warc/CC-MAIN-20151001222139-00240-ip-10-137-6-227.ec2.internal.warc.gz]
 from the common crawl data. This file contains several gzipped responses which 
are stored plaintext (without the gzip encoding).
I used [warctools|https://github.com/internetarchive/warctools] from Internet 
Archive to extract the responses out of the WARC file. However this tool 
expects the Content-Length field to match the actual length of the body in the 
WARC ([See the issue on 
github|https://github.com/internetarchive/warctools/pull/14#issuecomment-182048962]).
 warctools uses a more up to date version of hanzo warctools which is 
recommended on the [Common Crawl 
website|https://commoncrawl.org/the-data/get-started/] under "Processing the 
file format".
I have not been using Nutch and can therefore not say which versions are 
affected by this.
After reading [the official WARC 
draft|http://archive-access.sourceforge.net/warc/warc_file_format-0.9.html] I 
could not find out how gzipped content is supposed to be stored. However 
probably multiple WARC file parsers will have an issue with this.
It would be nice to know whether you consider this a bug and plan on fixing 
this and whether this is a major issue which concerns most WARC files of the 
Common Crawl data or only a small part.


> CommonCrawlDataDumper saves gzipped body in extracted form
> ----------------------------------------------------------
>
>                 Key: NUTCH-2213
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2213
>             Project: Nutch
>          Issue Type: Bug
>          Components: commoncrawl, dumpers
>            Reporter: Joris Rau
>            Priority: Critical
>              Labels: easyfix
>
> I have downloaded [a WARC 
> file|https://aws-publicdatasets.s3.amazonaws.com/common-crawl/crawl-data/CC-MAIN-2015-40/segments/1443738099622.98/warc/CC-MAIN-20151001222139-00240-ip-10-137-6-227.ec2.internal.warc.gz]
>  from the common crawl data. This file contains several gzipped responses 
> which are stored plaintext (without the gzip encoding).
> I used [warctools|https://github.com/internetarchive/warctools] from Internet 
> Archive to extract the responses out of the WARC file. However this tool 
> expects the Content-Length field to match the actual length of the body in 
> the WARC ([See the issue on 
> github|https://github.com/internetarchive/warctools/pull/14#issuecomment-182048962]).
>  warctools uses a more up to date version of hanzo warctools which is 
> recommended on the [Common Crawl 
> website|https://commoncrawl.org/the-data/get-started/] under "Processing the 
> file format".
> I have not been using Nutch and can therefore not say which versions are 
> affected by this.
> After reading [the official WARC 
> draft|http://archive-access.sourceforge.net/warc/warc_file_format-0.9.html] I 
> could not find out how gzipped content is supposed to be stored. However 
> probably multiple WARC file parsers will have an issue with this.
> It would be nice to know whether you consider this a bug and plan on fixing 
> this and whether this is a major issue which concerns most WARC files of the 
> Common Crawl data or only a small part.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to