Hi David,
David H. Lipman wrote:
I have seen WARC mentioned but have not seen a definition.
WARC (Web ARChive, ISO 28500:2009) [1] is a file format for storing web
resources. It is used for making archives of web sites. The Internet
Archive, for example, uses it as the file format for their Wayback
Machine and Heritrix crawler.
The nice thing about WARC is that it lets you store all information
about your web crawl: the files you download, of course, but also things
like the HTTP request and response headers, information about redirects
and error pages. WARC also provides a place to keep the related
metadata. It is, in short, a way to store everything, in a standardized
file format.
Adding WARC to wget means that you'll be able to do things like
wget --mirror http://www.gnu.org/s/wget/ --warc-file=gnu
which will produce (next to the normal wget download) a file named
'gnu.warc.gz' that contains every HTTP request and every HTTP response
that wget made. This is a 'archival grade' copy of the mirrored site.
Once you have the WARC file, you could store it in your archive, extract
files, run your own local Wayback Machine [2, 3].
wget is already a very useful tool to make a quick copy of a website,
adding WARC support helps to make the copy is as complete as possible.
Maybe that answers some of your questions?
Regards,
Gijs
[1] http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml
[2] http://archive-access.sourceforge.net/projects/wayback/
[3] http://netpreserve.org/software/downloads.php