Re: [Bug-wget] WARC, new version

2011-11-05 Thread Giuseppe Scrivano
Hey Gijs,

I have added a ChangeLog entry and pushed the change.

Thanks!
Giuseppe



Gijs van Tulder gvtul...@gmail.com writes:

 lovely.  I am going to push it soon with some small adjustments.

 That's good to hear.

 There's one other small adjustment that you may want to make, see the
 attached patch. One of the WARC functions uses the basename function,
 which causes problems on OS X. Including libgen.h and strdup-ing the
 output of basename seems to solve this problem.

 Thanks,

 Gijs


 Op 04-11-11 22:27 schreef Giuseppe Scrivano:
 Gijs van Tuldergvtul...@gmail.com  writes:

 Hi Giuseppe,

 * I've changed the configure.ac and src/Makefile.am.
 * I've added a ChangeLog entry.

 lovely.  I am going to push it soon with some small adjustments.

 Thanks for the great work.  Whenever it happens to be in the same place,
 I'll buy you a beer :-)

 Cheers,
 Giuseppe



Re: [Bug-wget] WARC, new version

2011-11-04 Thread Giuseppe Scrivano
Gijs van Tulder gvtul...@gmail.com writes:

 Hi Giuseppe,

 * I've changed the configure.ac and src/Makefile.am.
 * I've added a ChangeLog entry.

lovely.  I am going to push it soon with some small adjustments.

Thanks for the great work.  Whenever it happens to be in the same place,
I'll buy you a beer :-)

Cheers,
Giuseppe



Re: [Bug-wget] WARC, new version

2011-11-04 Thread Gijs van Tulder

 lovely.  I am going to push it soon with some small adjustments.

That's good to hear.

There's one other small adjustment that you may want to make, see the 
attached patch. One of the WARC functions uses the basename function, 
which causes problems on OS X. Including libgen.h and strdup-ing the 
output of basename seems to solve this problem.


Thanks,

Gijs


Op 04-11-11 22:27 schreef Giuseppe Scrivano:

Gijs van Tuldergvtul...@gmail.com  writes:


Hi Giuseppe,

* I've changed the configure.ac and src/Makefile.am.
* I've added a ChangeLog entry.


lovely.  I am going to push it soon with some small adjustments.

Thanks for the great work.  Whenever it happens to be in the same place,
I'll buy you a beer :-)

Cheers,
Giuseppe


--- a/src/warc.c	2011-11-04 17:41:11.383704054 +0100
+++ b/src/warc.c	2011-11-04 23:06:28.693712714 +0100
@@ -19,6 +19,10 @@
 #include uuid/uuid.h
 #endif
 
+#ifndef WINDOWS
+#include libgen.h
+#endif
+
 #include warc.h
 
 extern char *version_string;
@@ -605,7 +609,7 @@
 
   char *filename_copy, *filename_basename;
   filename_copy = strdup (filename);
-  filename_basename = basename (filename_copy);
+  filename_basename = strdup (basename (filename_copy));
 
   warc_write_start_record ();
   warc_write_header (WARC-Type, warcinfo);
@@ -619,6 +623,7 @@
   if (warc_tmp == NULL)
 {
   free (filename_copy);
+  free (filename_basename);
   return false;
 }
 
@@ -646,6 +651,7 @@
 }
 
   free (filename_copy);
+  free (filename_basename);
   fclose (warc_tmp);
   return warc_write_ok;
 }


Re: [Bug-wget] WARC, new version

2011-10-30 Thread Giuseppe Scrivano
Gijs van Tulder gvtul...@gmail.com writes:

 === modified file 'bootstrap.conf'
 --- bootstrap.conf2011-08-11 12:23:39 +
 +++ bootstrap.conf2011-10-21 19:24:18 +
 @@ -28,6 +28,7 @@
  accept
  alloca
  announce-gen
 +base32
  bind
  c-ctype
  clock-time
 @@ -49,6 +50,7 @@
  mbtowc
  mkdir
  crypto/md5
 +crypto/sha1
  pipe
  quote
  quotearg
 @@ -63,6 +65,7 @@
  stdbool
  strcasestr
  strerror_r-posix
 +tmpdir
  unlocked-io
  update-copyright
  vasprintf

 === modified file 'configure.ac'
 --- configure.ac  2011-09-04 12:19:12 +
 +++ configure.ac  2011-10-23 21:21:49 +
 @@ -511,7 +511,22 @@
fi
  fi
  
 -
 +# Warc
 +AC_CHECK_HEADER(uuid/uuid.h, UUID_FOUND=yes, UUID_FOUND=no)
 +if test x$UUID_FOUND = xno; then
 +  AC_MSG_ERROR([libuuid is required])
 +fi
 +
 +AC_CHECK_LIB(uuid, uuid_generate, UUID_FOUND=yes, UUID_FOUND=no)
 +if test x$UUID_FOUND = xno; then
 +  AC_MSG_ERROR([libuuid is required])
 +fi
 +LIBUUID=-luuid
 +AC_SUBST(LIBUUID)
 +LDFLAGS=${LDFLAGS} -L$libuuid/lib
 +CPPFLAGS=${CPPFLAGS} -I$libuuid/include

I think we shouldn't change the value of LDFLAGS and CPPFLAGS as they
are user variables.  Also, where is $libuuid defined?  We can just drop
these lines.



if (hs-res = 0)
  CLOSE_FINISH (sock);
else
 -{
 -  if (hs-res  0)
 -hs-rderrmsg = xstrdup (fd_errstr (sock));
 -  CLOSE_INVALIDATE (sock);
 -}
 +CLOSE_INVALIDATE (sock);

Why?


The rest seems ok, if you also provide a ChangeLog I can proceed to
merge it.

Thanks,
Giuseppe



Re: [Bug-wget] WARC, new version

2011-10-30 Thread David H. Lipman
From: Giuseppe Scrivano gscriv...@gnu.org
I have seen WARC mentioned but have not seen a definition.

What is WARC ?
What is WARC used for ?
Windows or 'nix ?
What are its benefits, etc ?





-- 
Dave
Multi-AV Scanning Tool - http://multi-av.thespykiller.co.uk
http://www.pctipp.ch/downloads/dl/35905.asp 






Re: [Bug-wget] WARC, new version

2011-10-30 Thread Gijs van Tulder

Hi David,

David H. Lipman wrote:

I have seen WARC mentioned but have not seen a definition.


WARC (Web ARChive, ISO 28500:2009) [1] is a file format for storing web 
resources. It  is used for making archives of web sites. The Internet 
Archive, for example, uses it as the file format for their Wayback 
Machine and Heritrix crawler.


The nice thing about WARC is that it lets you store all information 
about your web crawl: the files you download, of course, but also things 
like the HTTP request and response headers, information about redirects 
and error pages. WARC also provides a place to keep the related 
metadata. It is, in short, a way to store everything, in a standardized 
file format.


Adding WARC to wget means that you'll be able to do things like

  wget --mirror http://www.gnu.org/s/wget/ --warc-file=gnu

which will produce (next to the normal wget download) a file named 
'gnu.warc.gz' that contains every HTTP request and every HTTP response 
that wget made. This is a 'archival grade' copy of the mirrored site.


Once you have the WARC file, you could store it in your archive, extract 
files, run your own local Wayback Machine [2, 3].


wget is already a very useful tool to make a quick copy of a website, 
adding WARC support helps to make the copy is as complete as possible.


Maybe that answers some of your questions?

Regards,

Gijs


[1] http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml
[2] http://archive-access.sourceforge.net/projects/wayback/
[3] http://netpreserve.org/software/downloads.php



Re: [Bug-wget] WARC, new version

2011-10-30 Thread David H. Lipman
From: Gijs van Tulder gvtul...@gmail.com

 Hi David,

 David H. Lipman wrote:
 I have seen WARC mentioned but have not seen a definition.

 WARC (Web ARChive, ISO 28500:2009) [1] is a file format for storing web 
 resources. It 
 is used for making archives of web sites. The Internet Archive, for example, 
 uses it as 
 the file format for their Wayback Machine and Heritrix crawler.

 The nice thing about WARC is that it lets you store all information about 
 your web crawl: 
 the files you download, of course, but also things like the HTTP request and 
 response 
 headers, information about redirects and error pages. WARC also provides a 
 place to keep 
 the related metadata. It is, in short, a way to store everything, in a 
 standardized file 
 format.

 Adding WARC to wget means that you'll be able to do things like

wget --mirror http://www.gnu.org/s/wget/ --warc-file=gnu

 which will produce (next to the normal wget download) a file named 
 'gnu.warc.gz' that 
 contains every HTTP request and every HTTP response that wget made. This is a 
 'archival 
 grade' copy of the mirrored site.

 Once you have the WARC file, you could store it in your archive, extract 
 files, run your 
 own local Wayback Machine [2, 3].

 wget is already a very useful tool to make a quick copy of a website, adding 
 WARC 
 support helps to make the copy is as complete as possible.

 Maybe that answers some of your questions?

 Regards,

 Gijs


 [1] http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml
 [2] http://archive-access.sourceforge.net/projects/wayback/
 [3] http://netpreserve.org/software/downloads.php



It answers all the question and now I understand.

*Thank You Gijs !*

-- 
Dave
Multi-AV Scanning Tool - http://multi-av.thespykiller.co.uk
http://www.pctipp.ch/downloads/dl/35905.asp 






Re: [Bug-wget] WARC, new version

2011-10-23 Thread Giuseppe Scrivano
Gijs van Tulder gvtul...@gmail.com writes:

 Hi all,

 Based on the comments by Giuseppe and Ángel I've revised the
 implementation of the wget WARC extenstion. I've attached a patch.

 1. It's no longer based on the warctools library. Instead, I've
 written a couple of new WARC-writing functions, using zlib for the
 gzip compression. The new implementation is much smaller.

 2. I extracted a small part of the gethttp method in http.c and moved
 it to a new function, read_response_body, which is responsible for
 downloading the response body and writing it to a file.

 The WARC extension needs to save the response in multiple cases: when
 the response is successful, but also when the response is a redirect,
 401 unauthorized or an error. Moving the response-saving to a separate
 method makes it possible to reuse this part for all four situations.

 Any thoughts?

WOW great work!  It is much better now.

I wonder if it is possible to remove the dependency from libuuid, maybe
provide replacement for uuid_generate and uuid_unparse when libuuid is
not found?  Even a simple implementation based on rand?

Beside it, there are only very small adjustments which need to be done
to the code in order to include it into wget, like lines not longer than
80 characters or using foo *bar instead of foo * bar; in any case
these are not important and I can go trough them before commit your
changes.

Thanks,
Giuseppe