Rod Taylor wrote:
I see. Is there any way to speed up this phase? It seems to be taking as
long to run the sort phase as it did to download the data.

It would appear that nearly 30% of the time for the nutch fetch segment
is spent doing the sorts, so I'm well off the 20% overhead number you
seem to be able to achieve for a full cycle.

5 machines (4CPU) each with 8 tasks with a load average is about 5 and
they run Redhat. Context switches are low (under 1500/second). There is
virtually no IO (boxes have plenty of ram) but the kernel is doing a
bunch of work as 50% of CPU time is in system (unsure what, I'm not
familiar with the Linux DTrace type tools).

Sorting is usually i/o bound on mapred.local.dir. When eight tasks are using the same device this could become a bottleneck. Use iostat or sar to view disk i/o statistics.

My plan is to permit one to specify a list of directories for mapred.local.dir and have the sorting (and everything else) select randomly among these for temporary local files. That way all devices can be used in parallel.

As a workaround you could try starting eight tasktrackers, each configured with a different device for mapred.local.dir. Yes, that's a pain, but it would give us an idea of whether my analysis is correct.

Doug

Reply via email to