On Tue, Apr 27, 2010 at 12:50 PM, elsif <elsif.t...@gmail.com> wrote:
> The network checks out fine. Each node has 4G of memory and there are > no other java processes running besides the data node. The nodes all > have a very low load and very little swap in use: > > top - 12:40:45 up 272 days, 1:41, 2 users, load average: 0.10, 0.13, > 0.22 > Tasks: 63 total, 1 running, 62 sleeping, 0 stopped, 0 zombie > Cpu(s): 0.0%us, 6.0%sy, 0.0%ni, 92.7%id, 0.7%wa, 0.3%hi, 0.3%si, > 0.0%st > Mem: 3635144k total, 3109948k used, 525196k free, 2872k buffers > Swap: 15624944k total, 87440k used, 15537504k free, 2127708k cached > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 20053 root 20 0 1291m 686m 6960 S 3.7 19.3 11:31.74 java > > > One additional correlation point we are seeing is the block reports are > taking a long time: > > 2010-04-27 08:32:47,470 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: BlockReport of 733396 > blocks got processed in 308347 msecs > You have 733 thousand blocks on each DN? This is most likely your issue. There are some patches in trunk to improve performance for heavy-storage DNs, but that number of blocks is still very high. My guess is that you are storing lots of tiny files, which is not one of HDFS's strong suits. If you periodically jstack the DataNode JVM you may be able to see something causing contention. My guess is it that it is related to your high number of blocks. -Todd > 2010-04-27 09:42:47,701 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: BlockReport of 734668 > blocks got processed in 217988 msecs > 2010-04-27 10:24:02,943 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: BlockReport of 736449 > blocks got processed in 271354 msecs > 2010-04-27 11:23:08,826 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: BlockReport of 738089 > blocks got processed in 305396 msecs > 2010-04-27 12:23:10,333 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: BlockReport of 738367 > blocks got processed in 306389 msecs > > On 04/27/2010 11:33 AM, Todd Lipcon wrote: > > Those errors would indicate problems on the DN or client level, not > > the NN level. > > > > I'd double check your networking, make sure you don't have any > > switching issues, etc. Also double check for swapping on your DNs (if > > you see more than a few MB swapped out, you need to oversubscribe your > > memory less). > > > > -Todd > > > > > > -- Todd Lipcon Software Engineer, Cloudera