Hello,

We have a 35 node cluster running Hadoop 0.20.2 and are seeing periods
of namenode inactivity that cause block read and write errors.  We have
enabled garbage collection logging and have determined that garbage
collection is not the cause of the inactivity.  We have been able to
reproduce this behavior with as little as three clients doing reads and
writes within Hadoop.

During the namenode inactivity we see little if any namenode logging.
Namenode stats: 9992150 files and directories, 6629279 blocks = 16621429
total. Heap Size is 7.29 GB / 11.7 GB (62%)
 

During the inactive state, strace reports many calls to sched_yield:

[pid 26249] sched_yield( <unfinished ...>
[pid 26246] sched_yield( <unfinished ...>
[pid 26249] <... sched_yield resumed> ) = 0
[pid 26246] <... sched_yield resumed> ) = 0
[pid 26249] sched_yield( <unfinished ...>
[pid 26246] sched_yield( <unfinished ...>
[pid 26249] <... sched_yield resumed> ) = 0
[pid 26246] <... sched_yield resumed> ) = 0
[pid 26249] sched_yield( <unfinished ...>
[pid 26246] sched_yield( <unfinished ...>
[pid 26249] <... sched_yield resumed> ) = 0

An example of the errors seen:

Apr 27, 2010 5:02:19 PM org.apache.hadoop.hdfs.DFSClient$DFSOutputStream 
createBlockOutputStream
INFO: Exception in createBlockOutputStream java.net.SocketTimeoutException: 
69000 millis timeout while waiting for channel to be ready for read. ch : 
java.nio.channels.SocketChannel[connected local=/10.4.30.11:51257 
remote=/10.4.30.14:50010]
Apr 27, 2010 5:02:19 PM org.apache.hadoop.hdfs.DFSClient$DFSOutputStream 
nextBlockOutputStream
INFO: Abandoning block blk_1939160217996593958_48349447
Apr 27, 2010 5:02:19 PM org.apache.hadoop.hdfs.DFSClient$DFSOutputStream 
nextBlockOutputStream
INFO: Waiting to find target node: 10.4.30.14:50010
Apr 27, 2010 5:02:21 PM org.apache.hadoop.hdfs.DFSClient$DFSOutputStream 
createBlockOutputStream
INFO: Exception in createBlockOutputStream java.net.SocketTimeoutException: 
69000 millis timeout while waiting for channel to be ready for read. ch : 
java.nio.channels.SocketChannel[connected local=/10.4.30.11:50554 
remote=/10.4.30.30:50010]
Apr 27, 2010 5:02:21 PM org.apache.hadoop.hdfs.DFSClient$DFSOutputStream 
nextBlockOutputStream
INFO: Abandoning block blk_1314187820700137040_48349451
Apr 27, 2010 5:02:21 PM org.apache.hadoop.hdfs.DFSClient$DFSOutputStream 
nextBlockOutputStream
INFO: Waiting to find target node: 10.4.30.30:50010
Apr 27, 2010 5:02:29 PM org.apache.hadoop.hdfs.DFSClient$DFSOutputStream 
createBlockOutputStream
INFO: Exception in createBlockOutputStream java.net.SocketTimeoutException: 
69000 millis timeout while waiting for channel to be ready for read. ch : 
java.nio.channels.SocketChannel[connected local=/10.4.30.11:51369 
remote=/10.4.30.14:50010]
Apr 27, 2010 5:02:29 PM org.apache.hadoop.hdfs.DFSClient$DFSOutputStream 
nextBlockOutputStream
INFO: Abandoning block blk_8612699045898320923_48349455

The periods of inactivity last 40-70 seconds followed by 20-30 seconds
of activity and the process repeats.  We have seen this behavior on both
Hadoop 0.20.1 and 0.20.2 with the Sun jre build 1.6.0_13-b03.  

Any suggestions on this issue are greatly appreciated.

--

The datanodes occasionally report:

2010-04-27 09:54:21,247 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: 
DatanodeRegistration(10.4.30.24:50010, 
storageID=DS-1258702238-10.4.30.24-50010-1265310891468, infoPort=50075, 
ipcPort=50020):DataXceiver
java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.read(Unknown Source)
        at java.net.SocketInputStream.read(Unknown Source)
        at java.io.DataInputStream.readByte(Unknown Source)
        at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298)
        at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319)
        at org.apache.hadoop.io.Text.readString(Text.java:400)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:313)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:103)
        at java.lang.Thread.run(Unknown Source)


<configuration>

  <property>
    <name>dfs.name.dir</name>
    <value>/hadoop</value>
  </property>

  <property>
    <name>dfs.hosts</name>
    <value>/opt/hadoop/conf/dfs.hosts</value>
  </property>

  <property>
    <name>dfs.http.address</name>
    <value>hfs-030010:50070</value>
  </property>

  <property>
    <name>dfs.balance.bandwidthPerSec</name>
    <value>800000000</value>
  </property>

  <property>
    <name>dfs.datanode.max.xcievers</name>
    <value>4096</value>
  </property>

  <property>
    <name>dfs.datanode.socket.write.timeout</name>
    <value>0</value>
  </property>

  <property>
    <name>dfs.datanode.handler.count</name>
    <value>34</value>
  </property>

  <property>
    <name>dfs.namenode.handler.count</name>
    <value>64</value>
  </property>

  <property>
    <name>dfs.namenode.logging.level</name>
    <value>block</value>
  </property>

</configuration>




Reply via email to