Really need more info but here are some things to think about:
1) You may be maxing you bandwidth. Grep the log file for number of
timed out exceptions.
The number of timeout exceptions is comparatively low. Maybe on the
order of one thousand over the 4 days of log files. Far lower than
the greater than 1 million missing pages. The number of threads I'm
using is the same as I used on a python crawler I wrote to do this the
last time it was needed, which was only 12.
2) If using a high number of threads and a single DNS or multiple
fetcher machines with a single DNS instead of local cached DNS on
each machine, you may be DOS your DNS server which slows down
everything and can cause UnknownHost errors.
Well it's never throwing an exception. Also all the urls are on the
same host though I'm not sure what if anything would cache that. The
logs in general are eerily spotless considering it's failing more than
half the time.
3) You may be hitting a bad set. If doing whole web crawling we
routinely see bad/page rates of 50-75% on the lowest scoring pages
for a 100m page crawl While high scoring pages will typically yield
a 5-15% error rate.
The set is a fairly recent list of all non-redirect wikipedia pages.
I'm very familiar with the rate of wikipedia page deletions and it's
no where near the numbers I'm seeing.
The problem definitely seems to be causing no error messages in the
logs. I think setting the threads to 8 might solve my problem, but
considering there are no error messages, if I was doing a crawl of
greater than depth 1, I wouldn't trust the completeness of the crawl.
Dennis
Tristan Buckner wrote:
Lowering the number of threads to 8 from 12 seems to improve the
success rate. Is this expected behavior? It seems like if that is
the problem I should be getting fetch errors.
Tristan Buckner
Metaweb Technologies Inc.
On Jul 24, 2008, at 12:15 PM, Tristan Buckner wrote:
Hello fellow nutch users,
I've been trying to crawl a static list of about 2.3 million
urls. To do this I've inject my list of urls, set
db.update.additions.allowed to false, and set the crawl depth to
1. I ran this for about 4 days (which was the around the expected
time for a complete crawl) and when it completed grepping the log
files yielded about 2.3 million fetches with very few timeouts and
no other errors reported. However the on disk size looked way too
small and after iterating through all the pages, it turned out
only about 1.1 million had been downloaded.
The size of this crawl made for a pretty intractable debug cycle
so I started testing it on a 20k subset. The first time I ran it
yielded about 6k of the 20k. Updating, generating, and fetching
again yielded another 4k of which some were previously fetched and
some were not. Checking the crawldb, the missing files are marked
as unfetched, but I have the impression this is based on running
update on the fetched segments so if they were just discarded
without being written out it would look like this. I also set
fetching to verbose but there is no new info in the logs.
Any ideas?
Tristan Buckner
Metaweb Technologies Inc.