Re: non-obvious incomplete crawls

Tristan Buckner Thu, 24 Jul 2008 16:44:20 -0700

Lowering the number of threads to 8 from 12 seems to improve thesuccess rate. Is this expected behavior? It seems like if that isthe problem I should be getting fetch errors.


Tristan Buckner
Metaweb Technologies Inc.


On Jul 24, 2008, at 12:15 PM, Tristan Buckner wrote:

Hello fellow nutch users,
I've been trying to crawl a static list of about 2.3 million urls.To do this I've inject my list of urls, setdb.update.additions.allowed to false, and set the crawl depth to 1.I ran this for about 4 days (which was the around the expected timefor a complete crawl) and when it completed grepping the log filesyielded about 2.3 million fetches with very few timeouts and noother errors reported. However the on disk size looked way toosmall and after iterating through all the pages, it turned out onlyabout 1.1 million had been downloaded.
The size of this crawl made for a pretty intractable debug cycle soI started testing it on a 20k subset. The first time I ran ityielded about 6k of the 20k. Updating, generating, and fetchingagain yielded another 4k of which some were previously fetched andsome were not. Checking the crawldb, the missing files are markedas unfetched, but I have the impression this is based on runningupdate on the fetched segments so if they were just discardedwithout being written out it would look like this. I also setfetching to verbose but there is no new info in the logs.
Any ideas?

Tristan Buckner
Metaweb Technologies Inc.

Re: non-obvious incomplete crawls

Reply via email to