Hi,

There's a bug in the warc_find_duplicate_cdx_record function. If you provide a file with CDX records, Wget can segfault if a record is not found in the CDX file. In fact, the deduplication now only works if *every* new record can be found in the CDX index.

The segmentation fault is generated on these lines in src/warc.c:

  hash_table_get_pair (warc_cdx_dedup_table, sha1_digest_payload, &key,
                       &rec_existing);
  if (rec_existing != NULL && strcmp (rec_existing->url, url) == 0)

Other than the code expects hash_table_get_pair does not set rec_existing to NULL if no record is found. So instead of checking for NULL, the function should check if the return value of hash_table_get_pair is non-zero:

int found = hash_table_get_pair (warc_cdx_dedup_table, sha1_digest_payload,
                                   &key, &rec_existing);
  if (found && strcmp (rec_existing->url, url) == 0)

The attached patch makes this change. The deduplication works better.

Regards,

Gijs
>From 807b98d7d9289765c9f210336d2dbf294d663f99 Mon Sep 17 00:00:00 2001
From: Gijs van Tulder <gvtul...@gmail.com>
Date: Wed, 30 May 2012 23:00:04 +0200
Subject: [PATCH] warc: Fix segfault if CDX record is not found.

---
 src/ChangeLog |    4 ++++
 src/warc.c    |    6 +++---
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/src/ChangeLog b/src/ChangeLog
index 7e16b17..9e74e47 100644
--- a/src/ChangeLog
+++ b/src/ChangeLog
@@ -1,3 +1,7 @@
+2012-05-30  Gijs van Tulder  <gvtul...@gmail.com>
+
+	* warc.c: Fix segfault if CDX record is not found.
+
 2011-05-26  Steven Schweda  <s...@antinode.info>
 	* connect.c [HAVE_SYS_SOCKET_H]: Include <sys/socket.h>.
 	[HAVE_SYS_SELECT_H]: Include <sys/select.h>.
diff --git a/src/warc.c b/src/warc.c
index 24751db..92a49ef 100644
--- a/src/warc.c
+++ b/src/warc.c
@@ -1001,10 +1001,10 @@ warc_find_duplicate_cdx_record (char *url, char *sha1_digest_payload)
 
   char *key;
   struct warc_cdx_record *rec_existing;
-  hash_table_get_pair (warc_cdx_dedup_table, sha1_digest_payload, &key,
-                       &rec_existing);
+  int found = hash_table_get_pair (warc_cdx_dedup_table, sha1_digest_payload,
+                                   &key, &rec_existing);
 
-  if (rec_existing != NULL && strcmp (rec_existing->url, url) == 0)
+  if (found && strcmp (rec_existing->url, url) == 0)
     return rec_existing;
   else
     return NULL;
-- 
1.7.4.1

Reply via email to