Hi David,

David H. Lipman wrote:
I have seen WARC mentioned but have not seen a definition.

WARC (Web ARChive, ISO 28500:2009) [1] is a file format for storing web resources. It is used for making archives of web sites. The Internet Archive, for example, uses it as the file format for their Wayback Machine and Heritrix crawler.

The nice thing about WARC is that it lets you store all information about your web crawl: the files you download, of course, but also things like the HTTP request and response headers, information about redirects and error pages. WARC also provides a place to keep the related metadata. It is, in short, a way to store everything, in a standardized file format.

Adding WARC to wget means that you'll be able to do things like

  wget --mirror http://www.gnu.org/s/wget/ --warc-file=gnu

which will produce (next to the normal wget download) a file named 'gnu.warc.gz' that contains every HTTP request and every HTTP response that wget made. This is a 'archival grade' copy of the mirrored site.

Once you have the WARC file, you could store it in your archive, extract files, run your own local Wayback Machine [2, 3].

wget is already a very useful tool to make a quick copy of a website, adding WARC support helps to make the copy is as complete as possible.

Maybe that answers some of your questions?

Regards,

Gijs


[1] http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml
[2] http://archive-access.sourceforge.net/projects/wayback/
[3] http://netpreserve.org/software/downloads.php

Reply via email to