Re: non-obvious incomplete crawls

Tristan Buckner Thu, 24 Jul 2008 18:08:17 -0700

Really need more info but here are some things to think about:

1) You may be maxing you bandwidth. Grep the log file for number oftimed out exceptions.

The number of timeout exceptions is comparatively low. Maybe on theorder of one thousand over the 4 days of log files. Far lower thanthe greater than 1 million missing pages. The number of threads I'musing is the same as I used on a python crawler I wrote to do this thelast time it was needed, which was only 12.

2) If using a high number of threads and a single DNS or multiplefetcher machines with a single DNS instead of local cached DNS oneach machine, you may be DOS your DNS server which slows downeverything and can cause UnknownHost errors.

Well it's never throwing an exception. Also all the urls are on thesame host though I'm not sure what if anything would cache that. Thelogs in general are eerily spotless considering it's failing more thanhalf the time.

3) You may be hitting a bad set. If doing whole web crawling weroutinely see bad/page rates of 50-75% on the lowest scoring pagesfor a 100m page crawl While high scoring pages will typically yielda 5-15% error rate.

The set is a fairly recent list of all non-redirect wikipedia pages.I'm very familiar with the rate of wikipedia page deletions and it'sno where near the numbers I'm seeing.

The problem definitely seems to be causing no error messages in thelogs. I think setting the threads to 8 might solve my problem, butconsidering there are no error messages, if I was doing a crawl ofgreater than depth 1, I wouldn't trust the completeness of the crawl.

Dennis

Tristan Buckner wrote:
Lowering the number of threads to 8 from 12 seems to improve thesuccess rate. Is this expected behavior? It seems like if that isthe problem I should be getting fetch errors.
Tristan Buckner
Metaweb Technologies Inc.
On Jul 24, 2008, at 12:15 PM, Tristan Buckner wrote:
Hello fellow nutch users,
I've been trying to crawl a static list of about 2.3 millionurls. To do this I've inject my list of urls, setdb.update.additions.allowed to false, and set the crawl depth to1. I ran this for about 4 days (which was the around the expectedtime for a complete crawl) and when it completed grepping the logfiles yielded about 2.3 million fetches with very few timeouts andno other errors reported. However the on disk size looked way toosmall and after iterating through all the pages, it turned outonly about 1.1 million had been downloaded.
The size of this crawl made for a pretty intractable debug cycleso I started testing it on a 20k subset. The first time I ran ityielded about 6k of the 20k. Updating, generating, and fetchingagain yielded another 4k of which some were previously fetched andsome were not. Checking the crawldb, the missing files are markedas unfetched, but I have the impression this is based on runningupdate on the fetched segments so if they were just discardedwithout being written out it would look like this. I also setfetching to verbose but there is no new info in the logs.
Any ideas?

Tristan Buckner
Metaweb Technologies Inc.

Re: non-obvious incomplete crawls

Reply via email to