Hello, We have a 35 node cluster running Hadoop 0.20.2 and are seeing periods of namenode inactivity that cause block read and write errors. We have enabled garbage collection logging and have determined that garbage collection is not the cause of the inactivity. We have been able to reproduce this behavior with as little as three clients doing reads and writes within Hadoop.
During the namenode inactivity we see little if any namenode logging. Namenode stats: 9992150 files and directories, 6629279 blocks = 16621429 total. Heap Size is 7.29 GB / 11.7 GB (62%) During the inactive state, strace reports many calls to sched_yield: [pid 26249] sched_yield( <unfinished ...> [pid 26246] sched_yield( <unfinished ...> [pid 26249] <... sched_yield resumed> ) = 0 [pid 26246] <... sched_yield resumed> ) = 0 [pid 26249] sched_yield( <unfinished ...> [pid 26246] sched_yield( <unfinished ...> [pid 26249] <... sched_yield resumed> ) = 0 [pid 26246] <... sched_yield resumed> ) = 0 [pid 26249] sched_yield( <unfinished ...> [pid 26246] sched_yield( <unfinished ...> [pid 26249] <... sched_yield resumed> ) = 0 An example of the errors seen: Apr 27, 2010 5:02:19 PM org.apache.hadoop.hdfs.DFSClient$DFSOutputStream createBlockOutputStream INFO: Exception in createBlockOutputStream java.net.SocketTimeoutException: 69000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.4.30.11:51257 remote=/10.4.30.14:50010] Apr 27, 2010 5:02:19 PM org.apache.hadoop.hdfs.DFSClient$DFSOutputStream nextBlockOutputStream INFO: Abandoning block blk_1939160217996593958_48349447 Apr 27, 2010 5:02:19 PM org.apache.hadoop.hdfs.DFSClient$DFSOutputStream nextBlockOutputStream INFO: Waiting to find target node: 10.4.30.14:50010 Apr 27, 2010 5:02:21 PM org.apache.hadoop.hdfs.DFSClient$DFSOutputStream createBlockOutputStream INFO: Exception in createBlockOutputStream java.net.SocketTimeoutException: 69000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.4.30.11:50554 remote=/10.4.30.30:50010] Apr 27, 2010 5:02:21 PM org.apache.hadoop.hdfs.DFSClient$DFSOutputStream nextBlockOutputStream INFO: Abandoning block blk_1314187820700137040_48349451 Apr 27, 2010 5:02:21 PM org.apache.hadoop.hdfs.DFSClient$DFSOutputStream nextBlockOutputStream INFO: Waiting to find target node: 10.4.30.30:50010 Apr 27, 2010 5:02:29 PM org.apache.hadoop.hdfs.DFSClient$DFSOutputStream createBlockOutputStream INFO: Exception in createBlockOutputStream java.net.SocketTimeoutException: 69000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.4.30.11:51369 remote=/10.4.30.14:50010] Apr 27, 2010 5:02:29 PM org.apache.hadoop.hdfs.DFSClient$DFSOutputStream nextBlockOutputStream INFO: Abandoning block blk_8612699045898320923_48349455 The periods of inactivity last 40-70 seconds followed by 20-30 seconds of activity and the process repeats. We have seen this behavior on both Hadoop 0.20.1 and 0.20.2 with the Sun jre build 1.6.0_13-b03. Any suggestions on this issue are greatly appreciated. -- The datanodes occasionally report: 2010-04-27 09:54:21,247 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.4.30.24:50010, storageID=DS-1258702238-10.4.30.24-50010-1265310891468, infoPort=50075, ipcPort=50020):DataXceiver java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(Unknown Source) at java.net.SocketInputStream.read(Unknown Source) at java.io.DataInputStream.readByte(Unknown Source) at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298) at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319) at org.apache.hadoop.io.Text.readString(Text.java:400) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:313) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:103) at java.lang.Thread.run(Unknown Source) <configuration> <property> <name>dfs.name.dir</name> <value>/hadoop</value> </property> <property> <name>dfs.hosts</name> <value>/opt/hadoop/conf/dfs.hosts</value> </property> <property> <name>dfs.http.address</name> <value>hfs-030010:50070</value> </property> <property> <name>dfs.balance.bandwidthPerSec</name> <value>800000000</value> </property> <property> <name>dfs.datanode.max.xcievers</name> <value>4096</value> </property> <property> <name>dfs.datanode.socket.write.timeout</name> <value>0</value> </property> <property> <name>dfs.datanode.handler.count</name> <value>34</value> </property> <property> <name>dfs.namenode.handler.count</name> <value>64</value> </property> <property> <name>dfs.namenode.logging.level</name> <value>block</value> </property> </configuration>