Kevin, I'm glad it worked for you.
We talked a bit about 5114 yesterday, any chance of trying 0.18 branch on that same cluster without the socket timeout thing? Thx, J-D On Wed, Apr 8, 2009 at 9:24 AM, Kevin Eppinger <[email protected]> wrote: > FYI: Problem fixed. It was apparently a timeout condition present in 0.18.3 > that only popped up when the additional nodes were added. The solution was > to put the following entry in hadoop-site.xml: > > <property> > <name>dfs.datanode.socket.write.timeout</name> > <value>0</value> > </property> > > Thanks to 'jdcryans' and 'digarok' from IRC for the help. > > -kevin > > -----Original Message----- > From: Kevin Eppinger [mailto:[email protected]] > Sent: Tuesday, April 07, 2009 1:05 PM > To: [email protected] > Subject: Hadoop data nodes failing to start > > Hello everyone- > > So I have a 5 node cluster that I've been running for a few weeks with no > problems. Today I decided to add nodes and double its size to 10. After > doing all the setup and starting the cluster, I discovered that four out of > the 10 nodes had failed to startup. Specifically, the data nodes didn't > start. The task trackers seemed to start fine. Thinking it was something I > did incorrectly with the expansion, I then reverted back to the 5 node > configuration but I'm experiencing the same problem...with only 2 of 5 nodes > starting correctly. Here is what I'm seeing in the hadoop-*-datanode*.log > files: > > 2009-04-07 12:35:40,628 INFO org.apache.hadoop.dfs.DataNode: Starting > Periodic block scanner. > 2009-04-07 12:35:45,548 INFO org.apache.hadoop.dfs.DataNode: BlockReport of > 9269 blocks got processed in 1128 msecs > 2009-04-07 12:35:45,584 ERROR org.apache.hadoop.dfs.DataNode: > DatanodeRegistration(10.254.165.223:50010, storageID=DS-202528624-10.254.13 > 1.244-50010-1238604807366, infoPort=50075, ipcPort=50020):DataXceiveServer: > Exiting due to:java.nio.channels.ClosedSelectorException > at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:66) > at sun.nio.ch.SelectorImpl.selectNow(SelectorImpl.java:88) > at sun.nio.ch.Util.releaseTemporarySelector(Util.java:135) > at sun.nio.ch.ServerSocketAdaptor.accept(ServerSocketAdaptor.java:120) > at > org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:997) > at java.lang.Thread.run(Thread.java:619) > > After this the data node shuts down. This same message is appearing on all > the failed nodes. Help! > > -kevin >
