I've got the same problem. It takes ages to generate a list of url to fetch
with DB of 2M of urls.
Do you mean that it will be faster if we configure the generation by IP ?
I've read the nutch-default.xml. it said that we have to be careful because
it can generate a lot of DNS request to the host.
Does it mean that i have to configure a local cache DNS on my server ?
On 7/25/07, Luca Rondanini <[EMAIL PROTECTED]> wrote:
Hi all,
The gerate step of my crwal process is taking more then 2 hours....is it
normal?
Are you partitioning urls by ip or by host?
this is my stat report:
CrawlDb statistics start: crawl/crawldb
Statistics for CrawlDb: crawl/crawldb
TOTAL urls: 586860
retry 0: 578159
retry 1: 1983
retry 2: 2017
retry 3: 4701
min score: 0.0
avg score: 0.0
max score: 1.0
status 1 (db_unfetched): 164849
status 2 (db_fetched): 417306
status 3 (db_gone): 4701
status 5 (db_redir_perm): 4
CrawlDb statistics: done
Luca Rondanini
--
Doğacan Güney