Hi,

I am trying to index about 2 million pages I've crawled using Nutch. When I run the bin/nutch invertlinks and index commands, I often get my reduce tasks failing with the following message:

Task task_200711171111_0003_r_000000_1 failed to report status for 600 seconds. Killing!

(The 600 seconds ranges from 600 to 605 or so). This is while they are copying input data. Is there a way around this timeout?

I've also noticed that Nutch always uses only one reducer for these tasks, despite the size of the DB. Is this by design or is there a way to configure the number and make the jobs finish faster? The jobs take about 2 hours, most of which is spent running the sole reducer.

Thanks,
Matei

Reply via email to