Really need more info but here are some things to think about:
1) You may be maxing you bandwidth. Grep the log file for number of
timed out exceptions.
2) If using a high number of threads and a single DNS or multiple
fetcher machines with a single DNS instead of local cached DNS on each
machine, you may be DOS your DNS server which slows down everything and
can cause UnknownHost errors.
3) You may be hitting a bad set. If doing whole web crawling we
routinely see bad/page rates of 50-75% on the lowest scoring pages for a
100m page crawl While high scoring pages will typically yield a 5-15%
error rate.
Dennis
Tristan Buckner wrote:
Lowering the number of threads to 8 from 12 seems to improve the success
rate. Is this expected behavior? It seems like if that is the problem
I should be getting fetch errors.
Tristan Buckner
Metaweb Technologies Inc.
On Jul 24, 2008, at 12:15 PM, Tristan Buckner wrote:
Hello fellow nutch users,
I've been trying to crawl a static list of about 2.3 million urls. To
do this I've inject my list of urls, set db.update.additions.allowed
to false, and set the crawl depth to 1. I ran this for about 4 days
(which was the around the expected time for a complete crawl) and when
it completed grepping the log files yielded about 2.3 million fetches
with very few timeouts and no other errors reported. However the on
disk size looked way too small and after iterating through all the
pages, it turned out only about 1.1 million had been downloaded.
The size of this crawl made for a pretty intractable debug cycle so I
started testing it on a 20k subset. The first time I ran it yielded
about 6k of the 20k. Updating, generating, and fetching again yielded
another 4k of which some were previously fetched and some were not.
Checking the crawldb, the missing files are marked as unfetched, but I
have the impression this is based on running update on the fetched
segments so if they were just discarded without being written out it
would look like this. I also set fetching to verbose but there is no
new info in the logs.
Any ideas?
Tristan Buckner
Metaweb Technologies Inc.