The network checks out fine. Each node has 4G of memory and there are no other java processes running besides the data node. The nodes all have a very low load and very little swap in use:
top - 12:40:45 up 272 days, 1:41, 2 users, load average: 0.10, 0.13, 0.22 Tasks: 63 total, 1 running, 62 sleeping, 0 stopped, 0 zombie Cpu(s): 0.0%us, 6.0%sy, 0.0%ni, 92.7%id, 0.7%wa, 0.3%hi, 0.3%si, 0.0%st Mem: 3635144k total, 3109948k used, 525196k free, 2872k buffers Swap: 15624944k total, 87440k used, 15537504k free, 2127708k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 20053 root 20 0 1291m 686m 6960 S 3.7 19.3 11:31.74 java One additional correlation point we are seeing is the block reports are taking a long time: 2010-04-27 08:32:47,470 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: BlockReport of 733396 blocks got processed in 308347 msecs 2010-04-27 09:42:47,701 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: BlockReport of 734668 blocks got processed in 217988 msecs 2010-04-27 10:24:02,943 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: BlockReport of 736449 blocks got processed in 271354 msecs 2010-04-27 11:23:08,826 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: BlockReport of 738089 blocks got processed in 305396 msecs 2010-04-27 12:23:10,333 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: BlockReport of 738367 blocks got processed in 306389 msecs On 04/27/2010 11:33 AM, Todd Lipcon wrote: > Those errors would indicate problems on the DN or client level, not > the NN level. > > I'd double check your networking, make sure you don't have any > switching issues, etc. Also double check for swapping on your DNs (if > you see more than a few MB swapped out, you need to oversubscribe your > memory less). > > -Todd > >