Hi Ken, >> > 4. Any idea whether 4 hours is a reasonable amount of time for this >> >>> test? It seemed long to me, given that I was starting with a single >> >> > URL as the seed. >> > >> How many crawl passes did you do ? > > > Three deep, as in: bin/nutch crawl seeds -depth 3 > > This was the same as Doug described in his post here: > > http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200509.mbox/[EMAIL > PROTECTED] >
I assume the time it takes depends on your hardware, bandwidth, how many urls are being fetched and your mapreduce settings. 4 hours seems a bit long when starting from 1 url though. Are you using 2 or 3 slave machines? What values are you using for "fetcher.threads.fetch", "mapred.map.tasks" and "mapred.reduce.tasks"? When doing a "nutch readdb crawldb -stats", how many DB_unfetched and DB_fetched do you have? --Flo
