From: "Gijs van Tulder" <[email protected]> > Hi David, > > David H. Lipman wrote: >> I have seen WARC mentioned but have not seen a definition. > > WARC (Web ARChive, ISO 28500:2009) [1] is a file format for storing web > resources. It > is used for making archives of web sites. The Internet Archive, for example, > uses it as > the file format for their Wayback Machine and Heritrix crawler. > > The nice thing about WARC is that it lets you store all information about > your web crawl: > the files you download, of course, but also things like the HTTP request and > response > headers, information about redirects and error pages. WARC also provides a > place to keep > the related metadata. It is, in short, a way to store everything, in a > standardized file > format. > > Adding WARC to wget means that you'll be able to do things like > > wget --mirror http://www.gnu.org/s/wget/ --warc-file=gnu > > which will produce (next to the normal wget download) a file named > 'gnu.warc.gz' that > contains every HTTP request and every HTTP response that wget made. This is a > 'archival > grade' copy of the mirrored site. > > Once you have the WARC file, you could store it in your archive, extract > files, run your > own local Wayback Machine [2, 3]. > > wget is already a very useful tool to make a quick copy of a website, adding > WARC > support helps to make the copy is as complete as possible. > > Maybe that answers some of your questions? > > Regards, > > Gijs > > > [1] http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml > [2] http://archive-access.sourceforge.net/projects/wayback/ > [3] http://netpreserve.org/software/downloads.php >
It answers all the question and now I understand. *Thank You Gijs !* -- Dave Multi-AV Scanning Tool - http://multi-av.thespykiller.co.uk http://www.pctipp.ch/downloads/dl/35905.asp
