non-obvious incomplete crawls

Tristan Buckner Thu, 24 Jul 2008 12:19:28 -0700

Hello fellow nutch users,

I've been trying to crawl a static list of about 2.3 million urls. Todo this I've inject my list of urls, set db.update.additions.allowedto false, and set the crawl depth to 1. I ran this for about 4 days(which was the around the expected time for a complete crawl) and whenit completed grepping the log files yielded about 2.3 million fetcheswith very few timeouts and no other errors reported. However the ondisk size looked way too small and after iterating through all thepages, it turned out only about 1.1 million had been downloaded.

The size of this crawl made for a pretty intractable debug cycle so Istarted testing it on a 20k subset. The first time I ran it yieldedabout 6k of the 20k. Updating, generating, and fetching again yieldedanother 4k of which some were previously fetched and some were not.Checking the crawldb, the missing files are marked as unfetched, but Ihave the impression this is based on running update on the fetchedsegments so if they were just discarded without being written out itwould look like this. I also set fetching to verbose but there is nonew info in the logs.


Any ideas?

Tristan Buckner
Metaweb Technologies Inc.

non-obvious incomplete crawls

Reply via email to