Re: [Bug-wget] WARC, new version

David H. Lipman Sun, 30 Oct 2011 14:43:36 -0700

From: "Gijs van Tulder" <[email protected]>

> Hi David,
>
> David H. Lipman wrote:
>> I have seen WARC mentioned but have not seen a definition.
>
> WARC (Web ARChive, ISO 28500:2009) [1] is a file format for storing web 
> resources. It 
> is used for making archives of web sites. The Internet Archive, for example, 
> uses it as 
> the file format for their Wayback Machine and Heritrix crawler.
>
> The nice thing about WARC is that it lets you store all information about 
> your web crawl: 
> the files you download, of course, but also things like the HTTP request and 
> response 
> headers, information about redirects and error pages. WARC also provides a 
> place to keep 
> the related metadata. It is, in short, a way to store everything, in a 
> standardized file 
> format.
>
> Adding WARC to wget means that you'll be able to do things like
>
>    wget --mirror http://www.gnu.org/s/wget/ --warc-file=gnu
>
> which will produce (next to the normal wget download) a file named 
> 'gnu.warc.gz' that 
> contains every HTTP request and every HTTP response that wget made. This is a 
> 'archival 
> grade' copy of the mirrored site.
>
> Once you have the WARC file, you could store it in your archive, extract 
> files, run your 
> own local Wayback Machine [2, 3].
>
> wget is already a very useful tool to make a quick copy of a website, adding 
> WARC 
> support helps to make the copy is as complete as possible.
>
> Maybe that answers some of your questions?
>
> Regards,
>
> Gijs
>
>
> [1] http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml
> [2] http://archive-access.sourceforge.net/projects/wayback/
> [3] http://netpreserve.org/software/downloads.php
>



It answers all the question and now I understand.

*Thank You Gijs !*

-- 
Dave
Multi-AV Scanning Tool - http://multi-av.thespykiller.co.uk
http://www.pctipp.ch/downloads/dl/35905.asp

Re: [Bug-wget] WARC, new version

Reply via email to