Re: [Bug-wget] WARC, new version
Hey Gijs, I have added a ChangeLog entry and pushed the change. Thanks! Giuseppe Gijs van Tulder gvtul...@gmail.com writes: lovely. I am going to push it soon with some small adjustments. That's good to hear. There's one other small adjustment that you may want to make, see the attached patch. One of the WARC functions uses the basename function, which causes problems on OS X. Including libgen.h and strdup-ing the output of basename seems to solve this problem. Thanks, Gijs Op 04-11-11 22:27 schreef Giuseppe Scrivano: Gijs van Tuldergvtul...@gmail.com writes: Hi Giuseppe, * I've changed the configure.ac and src/Makefile.am. * I've added a ChangeLog entry. lovely. I am going to push it soon with some small adjustments. Thanks for the great work. Whenever it happens to be in the same place, I'll buy you a beer :-) Cheers, Giuseppe
Re: [Bug-wget] WARC, new version
Gijs van Tulder gvtul...@gmail.com writes: Hi Giuseppe, * I've changed the configure.ac and src/Makefile.am. * I've added a ChangeLog entry. lovely. I am going to push it soon with some small adjustments. Thanks for the great work. Whenever it happens to be in the same place, I'll buy you a beer :-) Cheers, Giuseppe
Re: [Bug-wget] WARC, new version
lovely. I am going to push it soon with some small adjustments. That's good to hear. There's one other small adjustment that you may want to make, see the attached patch. One of the WARC functions uses the basename function, which causes problems on OS X. Including libgen.h and strdup-ing the output of basename seems to solve this problem. Thanks, Gijs Op 04-11-11 22:27 schreef Giuseppe Scrivano: Gijs van Tuldergvtul...@gmail.com writes: Hi Giuseppe, * I've changed the configure.ac and src/Makefile.am. * I've added a ChangeLog entry. lovely. I am going to push it soon with some small adjustments. Thanks for the great work. Whenever it happens to be in the same place, I'll buy you a beer :-) Cheers, Giuseppe --- a/src/warc.c 2011-11-04 17:41:11.383704054 +0100 +++ b/src/warc.c 2011-11-04 23:06:28.693712714 +0100 @@ -19,6 +19,10 @@ #include uuid/uuid.h #endif +#ifndef WINDOWS +#include libgen.h +#endif + #include warc.h extern char *version_string; @@ -605,7 +609,7 @@ char *filename_copy, *filename_basename; filename_copy = strdup (filename); - filename_basename = basename (filename_copy); + filename_basename = strdup (basename (filename_copy)); warc_write_start_record (); warc_write_header (WARC-Type, warcinfo); @@ -619,6 +623,7 @@ if (warc_tmp == NULL) { free (filename_copy); + free (filename_basename); return false; } @@ -646,6 +651,7 @@ } free (filename_copy); + free (filename_basename); fclose (warc_tmp); return warc_write_ok; }
Re: [Bug-wget] WARC, new version
Gijs van Tulder gvtul...@gmail.com writes: === modified file 'bootstrap.conf' --- bootstrap.conf2011-08-11 12:23:39 + +++ bootstrap.conf2011-10-21 19:24:18 + @@ -28,6 +28,7 @@ accept alloca announce-gen +base32 bind c-ctype clock-time @@ -49,6 +50,7 @@ mbtowc mkdir crypto/md5 +crypto/sha1 pipe quote quotearg @@ -63,6 +65,7 @@ stdbool strcasestr strerror_r-posix +tmpdir unlocked-io update-copyright vasprintf === modified file 'configure.ac' --- configure.ac 2011-09-04 12:19:12 + +++ configure.ac 2011-10-23 21:21:49 + @@ -511,7 +511,22 @@ fi fi - +# Warc +AC_CHECK_HEADER(uuid/uuid.h, UUID_FOUND=yes, UUID_FOUND=no) +if test x$UUID_FOUND = xno; then + AC_MSG_ERROR([libuuid is required]) +fi + +AC_CHECK_LIB(uuid, uuid_generate, UUID_FOUND=yes, UUID_FOUND=no) +if test x$UUID_FOUND = xno; then + AC_MSG_ERROR([libuuid is required]) +fi +LIBUUID=-luuid +AC_SUBST(LIBUUID) +LDFLAGS=${LDFLAGS} -L$libuuid/lib +CPPFLAGS=${CPPFLAGS} -I$libuuid/include I think we shouldn't change the value of LDFLAGS and CPPFLAGS as they are user variables. Also, where is $libuuid defined? We can just drop these lines. if (hs-res = 0) CLOSE_FINISH (sock); else -{ - if (hs-res 0) -hs-rderrmsg = xstrdup (fd_errstr (sock)); - CLOSE_INVALIDATE (sock); -} +CLOSE_INVALIDATE (sock); Why? The rest seems ok, if you also provide a ChangeLog I can proceed to merge it. Thanks, Giuseppe
Re: [Bug-wget] WARC, new version
From: Giuseppe Scrivano gscriv...@gnu.org I have seen WARC mentioned but have not seen a definition. What is WARC ? What is WARC used for ? Windows or 'nix ? What are its benefits, etc ? -- Dave Multi-AV Scanning Tool - http://multi-av.thespykiller.co.uk http://www.pctipp.ch/downloads/dl/35905.asp
Re: [Bug-wget] WARC, new version
Hi David, David H. Lipman wrote: I have seen WARC mentioned but have not seen a definition. WARC (Web ARChive, ISO 28500:2009) [1] is a file format for storing web resources. It is used for making archives of web sites. The Internet Archive, for example, uses it as the file format for their Wayback Machine and Heritrix crawler. The nice thing about WARC is that it lets you store all information about your web crawl: the files you download, of course, but also things like the HTTP request and response headers, information about redirects and error pages. WARC also provides a place to keep the related metadata. It is, in short, a way to store everything, in a standardized file format. Adding WARC to wget means that you'll be able to do things like wget --mirror http://www.gnu.org/s/wget/ --warc-file=gnu which will produce (next to the normal wget download) a file named 'gnu.warc.gz' that contains every HTTP request and every HTTP response that wget made. This is a 'archival grade' copy of the mirrored site. Once you have the WARC file, you could store it in your archive, extract files, run your own local Wayback Machine [2, 3]. wget is already a very useful tool to make a quick copy of a website, adding WARC support helps to make the copy is as complete as possible. Maybe that answers some of your questions? Regards, Gijs [1] http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml [2] http://archive-access.sourceforge.net/projects/wayback/ [3] http://netpreserve.org/software/downloads.php
Re: [Bug-wget] WARC, new version
From: Gijs van Tulder gvtul...@gmail.com Hi David, David H. Lipman wrote: I have seen WARC mentioned but have not seen a definition. WARC (Web ARChive, ISO 28500:2009) [1] is a file format for storing web resources. It is used for making archives of web sites. The Internet Archive, for example, uses it as the file format for their Wayback Machine and Heritrix crawler. The nice thing about WARC is that it lets you store all information about your web crawl: the files you download, of course, but also things like the HTTP request and response headers, information about redirects and error pages. WARC also provides a place to keep the related metadata. It is, in short, a way to store everything, in a standardized file format. Adding WARC to wget means that you'll be able to do things like wget --mirror http://www.gnu.org/s/wget/ --warc-file=gnu which will produce (next to the normal wget download) a file named 'gnu.warc.gz' that contains every HTTP request and every HTTP response that wget made. This is a 'archival grade' copy of the mirrored site. Once you have the WARC file, you could store it in your archive, extract files, run your own local Wayback Machine [2, 3]. wget is already a very useful tool to make a quick copy of a website, adding WARC support helps to make the copy is as complete as possible. Maybe that answers some of your questions? Regards, Gijs [1] http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml [2] http://archive-access.sourceforge.net/projects/wayback/ [3] http://netpreserve.org/software/downloads.php It answers all the question and now I understand. *Thank You Gijs !* -- Dave Multi-AV Scanning Tool - http://multi-av.thespykiller.co.uk http://www.pctipp.ch/downloads/dl/35905.asp
Re: [Bug-wget] WARC, new version
Gijs van Tulder gvtul...@gmail.com writes: Hi all, Based on the comments by Giuseppe and Ángel I've revised the implementation of the wget WARC extenstion. I've attached a patch. 1. It's no longer based on the warctools library. Instead, I've written a couple of new WARC-writing functions, using zlib for the gzip compression. The new implementation is much smaller. 2. I extracted a small part of the gethttp method in http.c and moved it to a new function, read_response_body, which is responsible for downloading the response body and writing it to a file. The WARC extension needs to save the response in multiple cases: when the response is successful, but also when the response is a redirect, 401 unauthorized or an error. Moving the response-saving to a separate method makes it possible to reuse this part for all four situations. Any thoughts? WOW great work! It is much better now. I wonder if it is possible to remove the dependency from libuuid, maybe provide replacement for uuid_generate and uuid_unparse when libuuid is not found? Even a simple implementation based on rand? Beside it, there are only very small adjustments which need to be done to the code in order to include it into wget, like lines not longer than 80 characters or using foo *bar instead of foo * bar; in any case these are not important and I can go trough them before commit your changes. Thanks, Giuseppe