Lowering the number of threads to 8 from 12 seems to improve the
success rate. Is this expected behavior? It seems like if that is
the problem I should be getting fetch errors.
Tristan Buckner
Metaweb Technologies Inc.
On Jul 24, 2008, at 12:15 PM, Tristan Buckner wrote:
Hello fellow nutch users,
I've been trying to crawl a static list of about 2.3 million urls.
To do this I've inject my list of urls, set
db.update.additions.allowed to false, and set the crawl depth to 1.
I ran this for about 4 days (which was the around the expected time
for a complete crawl) and when it completed grepping the log files
yielded about 2.3 million fetches with very few timeouts and no
other errors reported. However the on disk size looked way too
small and after iterating through all the pages, it turned out only
about 1.1 million had been downloaded.
The size of this crawl made for a pretty intractable debug cycle so
I started testing it on a 20k subset. The first time I ran it
yielded about 6k of the 20k. Updating, generating, and fetching
again yielded another 4k of which some were previously fetched and
some were not. Checking the crawldb, the missing files are marked
as unfetched, but I have the impression this is based on running
update on the fetched segments so if they were just discarded
without being written out it would look like this. I also set
fetching to verbose but there is no new info in the logs.
Any ideas?
Tristan Buckner
Metaweb Technologies Inc.