On Mon, 2005-10-03 at 13:12 -0700, Doug Cutting wrote: > Rod Taylor wrote: > > I have high load, but it seems that the percentage progress progress > > during the reduce > sort phase of fetch (parse?) is not increasing which > > makes it appear as if nothing is happening (stuck at 0.5, or 50%). > > That's correct. There are currently no progress reports during sorting. > Reduce progress sticks at 50% during sorting, and jumps to 75% on > completion of the sort phase.
I see. Is there any way to speed up this phase? It seems to be taking as long to run the sort phase as it did to download the data. It would appear that nearly 30% of the time for the nutch fetch segment is spent doing the sorts, so I'm well off the 20% overhead number you seem to be able to achieve for a full cycle. 5 machines (4CPU) each with 8 tasks with a load average is about 5 and they run Redhat. Context switches are low (under 1500/second). There is virtually no IO (boxes have plenty of ram) but the kernel is doing a bunch of work as 50% of CPU time is in system (unsure what, I'm not familiar with the Linux DTrace type tools). I generated the segment for the top 10Million pages, with 10 pages per host. map.tasks=383, reduce.tasks=43 -- Rod Taylor <[EMAIL PROTECTED]>
