Re: [Bug-wget] WARC, new version

Gijs van Tulder Sun, 30 Oct 2011 14:33:27 -0700

Hi David,

David H. Lipman wrote:

I have seen WARC mentioned but have not seen a definition.

WARC (Web ARChive, ISO 28500:2009) [1] is a file format for storing webresources. It is used for making archives of web sites. The InternetArchive, for example, uses it as the file format for their WaybackMachine and Heritrix crawler.

The nice thing about WARC is that it lets you store all informationabout your web crawl: the files you download, of course, but also thingslike the HTTP request and response headers, information about redirectsand error pages. WARC also provides a place to keep the relatedmetadata. It is, in short, a way to store everything, in a standardizedfile format.


Adding WARC to wget means that you'll be able to do things like

  wget --mirror http://www.gnu.org/s/wget/ --warc-file=gnu

which will produce (next to the normal wget download) a file named'gnu.warc.gz' that contains every HTTP request and every HTTP responsethat wget made. This is a 'archival grade' copy of the mirrored site.

Once you have the WARC file, you could store it in your archive, extractfiles, run your own local Wayback Machine [2, 3].

wget is already a very useful tool to make a quick copy of a website,adding WARC support helps to make the copy is as complete as possible.


Maybe that answers some of your questions?

Regards,

Gijs


[1] http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml
[2] http://archive-access.sourceforge.net/projects/wayback/
[3] http://netpreserve.org/software/downloads.php

Re: [Bug-wget] WARC, new version

Reply via email to