Doug Cutting
Thu, 04 Aug 2005 12:54:23 -0700
Stefan Groschupf wrote:
http://wiki.apache.org/nutch/PresentationsCan you explan what this means: Page 20: - cheduling is bottleneck, not disk, network or CPU?
I mean that neither the CPUs, disks or network are at 100% of capacity. Disks are running around 50% busy, CPUs a bit higher, and the network switch has lots of bandwidth left. (Although, if we used multiple racks connected with gigabit links, these inter-rack links would already be near capacity.) So sometimes the CPU is busy generating random data and stuffing it in a buffer, and sometimes the disk is busy writing data, but we're not keeping both busy at the same time all the time. Perhaps if more threads/processes and/or bigger buffers would increase the utilization--I have not tried to tune things for this benchmark. But I am not dissapointed with this performance. Rather, I think that it is fast enough so that with real applications, with non-trival map and reduce functions, NDFS will not be a bottleneck.
Doug