Re: mapred Sort Progress Reports

Doug Cutting Mon, 03 Oct 2005 14:11:41 -0700

Rod Taylor wrote:

I see. Is there any way to speed up this phase? It seems to be taking as
long to run the sort phase as it did to download the data.


It would appear that nearly 30% of the time for the nutch fetch segment
is spent doing the sorts, so I'm well off the 20% overhead number you
seem to be able to achieve for a full cycle.

5 machines (4CPU) each with 8 tasks with a load average is about 5 and
they run Redhat. Context switches are low (under 1500/second). There is
virtually no IO (boxes have plenty of ram) but the kernel is doing a
bunch of work as 50% of CPU time is in system (unsure what, I'm not
familiar with the Linux DTrace type tools).

Sorting is usually i/o bound on mapred.local.dir. When eight tasks areusing the same device this could become a bottleneck. Use iostat or sarto view disk i/o statistics.

My plan is to permit one to specify a list of directories formapred.local.dir and have the sorting (and everything else) selectrandomly among these for temporary local files. That way all devicescan be used in parallel.

As a workaround you could try starting eight tasktrackers, eachconfigured with a different device for mapred.local.dir. Yes, that's apain, but it would give us an idea of whether my analysis is correct.


Doug

Re: mapred Sort Progress Reports

Reply via email to