Those errors would indicate problems on the DN or client level, not the NN level.
I'd double check your networking, make sure you don't have any switching issues, etc. Also double check for swapping on your DNs (if you see more than a few MB swapped out, you need to oversubscribe your memory less). -Todd On Tue, Apr 27, 2010 at 10:22 AM, elsif <elsif.t...@gmail.com> wrote: > Hello, > > We have a 35 node cluster running Hadoop 0.20.2 and are seeing periods > of namenode inactivity that cause block read and write errors. We have > enabled garbage collection logging and have determined that garbage > collection is not the cause of the inactivity. We have been able to > reproduce this behavior with as little as three clients doing reads and > writes within Hadoop. > > During the namenode inactivity we see little if any namenode logging. > Namenode stats: 9992150 files and directories, 6629279 blocks = 16621429 > total. Heap Size is 7.29 GB / 11.7 GB (62%) > > > During the inactive state, strace reports many calls to sched_yield: > > [pid 26249] sched_yield( <unfinished ...> > [pid 26246] sched_yield( <unfinished ...> > [pid 26249] <... sched_yield resumed> ) = 0 > [pid 26246] <... sched_yield resumed> ) = 0 > [pid 26249] sched_yield( <unfinished ...> > [pid 26246] sched_yield( <unfinished ...> > [pid 26249] <... sched_yield resumed> ) = 0 > [pid 26246] <... sched_yield resumed> ) = 0 > [pid 26249] sched_yield( <unfinished ...> > [pid 26246] sched_yield( <unfinished ...> > [pid 26249] <... sched_yield resumed> ) = 0 > > An example of the errors seen: > > Apr 27, 2010 5:02:19 PM org.apache.hadoop.hdfs.DFSClient$DFSOutputStream > createBlockOutputStream > INFO: Exception in createBlockOutputStream java.net.SocketTimeoutException: > 69000 millis timeout while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/10.4.30.11:51257 remote=/ > 10.4.30.14:50010] > Apr 27, 2010 5:02:19 PM org.apache.hadoop.hdfs.DFSClient$DFSOutputStream > nextBlockOutputStream > INFO: Abandoning block blk_1939160217996593958_48349447 > Apr 27, 2010 5:02:19 PM org.apache.hadoop.hdfs.DFSClient$DFSOutputStream > nextBlockOutputStream > INFO: Waiting to find target node: 10.4.30.14:50010 > Apr 27, 2010 5:02:21 PM org.apache.hadoop.hdfs.DFSClient$DFSOutputStream > createBlockOutputStream > INFO: Exception in createBlockOutputStream java.net.SocketTimeoutException: > 69000 millis timeout while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/10.4.30.11:50554 remote=/ > 10.4.30.30:50010] > Apr 27, 2010 5:02:21 PM org.apache.hadoop.hdfs.DFSClient$DFSOutputStream > nextBlockOutputStream > INFO: Abandoning block blk_1314187820700137040_48349451 > Apr 27, 2010 5:02:21 PM org.apache.hadoop.hdfs.DFSClient$DFSOutputStream > nextBlockOutputStream > INFO: Waiting to find target node: 10.4.30.30:50010 > Apr 27, 2010 5:02:29 PM org.apache.hadoop.hdfs.DFSClient$DFSOutputStream > createBlockOutputStream > INFO: Exception in createBlockOutputStream java.net.SocketTimeoutException: > 69000 millis timeout while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/10.4.30.11:51369 remote=/ > 10.4.30.14:50010] > Apr 27, 2010 5:02:29 PM org.apache.hadoop.hdfs.DFSClient$DFSOutputStream > nextBlockOutputStream > INFO: Abandoning block blk_8612699045898320923_48349455 > > The periods of inactivity last 40-70 seconds followed by 20-30 seconds > of activity and the process repeats. We have seen this behavior on both > Hadoop 0.20.1 and 0.20.2 with the Sun jre build 1.6.0_13-b03. > > Any suggestions on this issue are greatly appreciated. > > -- > > The datanodes occasionally report: > > 2010-04-27 09:54:21,247 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration( > 10.4.30.24:50010, storageID=DS-1258702238-10.4.30.24-50010-1265310891468, > infoPort=50075, ipcPort=50020):DataXceiver > java.net.SocketTimeoutException: Read timed out > at java.net.SocketInputStream.socketRead0(Native Method) > at java.net.SocketInputStream.read(Unknown Source) > at java.net.SocketInputStream.read(Unknown Source) > at java.io.DataInputStream.readByte(Unknown Source) > at > org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298) > at > org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319) > at org.apache.hadoop.io.Text.readString(Text.java:400) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:313) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:103) > at java.lang.Thread.run(Unknown Source) > > > <configuration> > > <property> > <name>dfs.name.dir</name> > <value>/hadoop</value> > </property> > > <property> > <name>dfs.hosts</name> > <value>/opt/hadoop/conf/dfs.hosts</value> > </property> > > <property> > <name>dfs.http.address</name> > <value>hfs-030010:50070</value> > </property> > > <property> > <name>dfs.balance.bandwidthPerSec</name> > <value>800000000</value> > </property> > > <property> > <name>dfs.datanode.max.xcievers</name> > <value>4096</value> > </property> > > <property> > <name>dfs.datanode.socket.write.timeout</name> > <value>0</value> > </property> > > <property> > <name>dfs.datanode.handler.count</name> > <value>34</value> > </property> > > <property> > <name>dfs.namenode.handler.count</name> > <value>64</value> > </property> > > <property> > <name>dfs.namenode.logging.level</name> > <value>block</value> > </property> > > </configuration> > > > > > -- Todd Lipcon Software Engineer, Cloudera