Exceptions in DataXceiver#run can result in a zombie datanode 
--------------------------------------------------------------

                 Key: HDFS-2182
                 URL: https://issues.apache.org/jira/browse/HDFS-2182
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: data-node
            Reporter: Eli Collins
             Fix For: 0.23.0


DataXceiver#run currently swallows all exceptions, it should instead plumb them 
up to DataXceiverServer#run so it can decide whether the exception should be 
tolerated or the daemon should exit. An IOE should be tolerated (because it's 
likely just an issue with a particular thread, or an intermittent failure), as 
it is today, but eg j.l.Error should be not. 

This came up in the following bug I'm seeing on a test cluster: if there's eg a 
NoClassDefFoundError thrown in DataXceiver#run (because the host jars were 
replaced out from underneath it, it ran out of descriptors, etc.) we'll end up 
with a datanode that is alive but always fails because it can't create any 
DataXceiver threads. In this case the datanode should shut itself down rather 
than continue to run.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to