The network checks out fine.  Each node has 4G of memory and there are
no other java processes running besides the data node.  The nodes all
have a very low load and very little swap in use:

top - 12:40:45 up 272 days,  1:41,  2 users,  load average: 0.10, 0.13, 0.22
Tasks:  63 total,   1 running,  62 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  6.0%sy,  0.0%ni, 92.7%id,  0.7%wa,  0.3%hi,  0.3%si, 
0.0%st
Mem:   3635144k total,  3109948k used,   525196k free,     2872k buffers
Swap: 15624944k total,    87440k used, 15537504k free,  2127708k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
20053 root      20   0 1291m 686m 6960 S  3.7 19.3  11:31.74 java


One additional correlation point we are seeing is the block reports are
taking a long time:

2010-04-27 08:32:47,470 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: BlockReport of 733396
blocks got processed in 308347 msecs
2010-04-27 09:42:47,701 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: BlockReport of 734668
blocks got processed in 217988 msecs
2010-04-27 10:24:02,943 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: BlockReport of 736449
blocks got processed in 271354 msecs
2010-04-27 11:23:08,826 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: BlockReport of 738089
blocks got processed in 305396 msecs
2010-04-27 12:23:10,333 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: BlockReport of 738367
blocks got processed in 306389 msecs

On 04/27/2010 11:33 AM, Todd Lipcon wrote:
> Those errors would indicate problems on the DN or client level, not
> the NN level.
>
> I'd double check your networking, make sure you don't have any
> switching issues, etc. Also double check for swapping on your DNs (if
> you see more than a few MB swapped out, you need to oversubscribe your
> memory less).
>
> -Todd
>
>

Reply via email to