FYI:  Problem fixed.  It was apparently a timeout condition present in 0.18.3 
that only popped up when the additional nodes were added.  The solution was to 
put the following entry in hadoop-site.xml:

<property>
   <name>dfs.datanode.socket.write.timeout</name>
   <value>0</value>
</property>

Thanks to 'jdcryans' and 'digarok' from IRC for the help.

-kevin

-----Original Message-----
From: Kevin Eppinger [mailto:[email protected]] 
Sent: Tuesday, April 07, 2009 1:05 PM
To: [email protected]
Subject: Hadoop data nodes failing to start

Hello everyone-

So I have a 5 node cluster that I've been running for a few weeks with no 
problems.  Today I decided to add nodes and double its size to 10.  After doing 
all the setup and starting the cluster, I discovered that four out of the 10 
nodes had failed to startup.  Specifically, the data nodes didn't start.  The 
task trackers seemed to start fine.  Thinking it was something I did 
incorrectly with the expansion, I then reverted back to the 5 node 
configuration but I'm experiencing the same problem...with only 2 of 5 nodes 
starting correctly.  Here is what I'm seeing in the hadoop-*-datanode*.log 
files:

2009-04-07 12:35:40,628 INFO org.apache.hadoop.dfs.DataNode: Starting Periodic 
block scanner.
2009-04-07 12:35:45,548 INFO org.apache.hadoop.dfs.DataNode: BlockReport of 
9269 blocks got processed in 1128 msecs
2009-04-07 12:35:45,584 ERROR org.apache.hadoop.dfs.DataNode: 
DatanodeRegistration(10.254.165.223:50010, storageID=DS-202528624-10.254.13
1.244-50010-1238604807366, infoPort=50075, ipcPort=50020):DataXceiveServer: 
Exiting due to:java.nio.channels.ClosedSelectorException
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:66)
        at sun.nio.ch.SelectorImpl.selectNow(SelectorImpl.java:88)
        at sun.nio.ch.Util.releaseTemporarySelector(Util.java:135)
        at sun.nio.ch.ServerSocketAdaptor.accept(ServerSocketAdaptor.java:120)
        at 
org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:997)
        at java.lang.Thread.run(Thread.java:619)

After this the data node shuts down.  This same message is appearing on all the 
failed nodes.  Help!

-kevin

Reply via email to