MoD wrote:
Julien,

I did tryed with 2048M / Task child,
no luck I still have two reduce that doesn't go through,

Is it somewhat related to the number of reduce,
on this cluster I have 4 servers :
- dual xeon dual core (8 core)
- 8Gb ram
- 4 disks

I did set mapred.reduce.tasks and mapred.map.tasks to 16.
because : 4 server of 4 disks. (what do you think)

Maybe if this job is too big for my cluster, does adding reduce task
could subdivise the problem into smaller reduces.
indeed I think no, cause I guess the input key is for the same domain ?

so my two last reduce task are the biggest domains of my DB ?

This is likely caused by a large number of inlinks for certain urls - the updatedb reduce collects this list in memory, and this sometimes leads to memory exhaustion. Please try limiting the max. number of inlinks per url (see nutch-default.xml for details).


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to