On Mon, 2005-10-03 at 14:11 -0700, Doug Cutting wrote:
> Rod Taylor wrote:
> > I see. Is there any way to speed up this phase? It seems to be taking as
> > long to run the sort phase as it did to download the data.
> >
> > It would appear that nearly 30% of the time for the nutch fetch segment
> > is spent doing the sorts, so I'm well off the 20% overhead number you
> > seem to be able to achieve for a full cycle.
> >
> > 5 machines (4CPU) each with 8 tasks with a load average is about 5 and
> > they run Redhat. Context switches are low (under 1500/second). There is
> > virtually no IO (boxes have plenty of ram) but the kernel is doing a
> > bunch of work as 50% of CPU time is in system (unsure what, I'm not
> > familiar with the Linux DTrace type tools).
>
> Sorting is usually i/o bound on mapred.local.dir. When eight tasks are
> using the same device this could become a bottleneck. Use iostat or sar
> to view disk i/o statistics.
Virtually no IO reported at all. Averages about 200kB/sec read and
writes are usually 0, but burst to 120MB/sec for under 1 second once
every 30 seconds or so.
The IO used is not all that high (lots of ram) and the wait time isn't
too bad for a single drive. I presume Nutch caches enough in memory so
that it can read/write large blocks of data allowing the harddrive for
mostly sequential IO.
avg-cpu: %user %nice %sys %iowait %idle
60.15 0.00 39.07 0.10 0.68
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s
wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 246.35 1.30 12.09 307.69 2067.53 153.85
1033.77 177.43 0.58 43.54 5.21 6.97
avg-cpu: %user %nice %sys %iowait %idle
61.86 0.00 37.29 0.12 0.72
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s
wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.10 271.67 1.60 12.01 410.81 2269.47 205.41
1134.73 196.88 0.59 43.45 4.82 6.57
Well under what can be bursted:
[EMAIL PROTECTED] ~]$ time dd if=/dev/zero of=/home/rbt/here bs=4K
count=25600
25600+0 records in
25600+0 records out
real 0m0.488s
user 0m0.003s
sys 0m0.474s
[EMAIL PROTECTED] ~]$ ls -lah here
-rw-rw-r-- 1 rbt rbt 100M Oct 3 17:35 here
Does nutch fsync the data or is pessimistic in other ways?
--
Rod Taylor <[EMAIL PROTECTED]>