Jack Levin created HDFS-6022:
--------------------------------

             Summary: Moving deadNodes from being thread local. Improving dead 
datanode handling in DFSClient 
                 Key: HDFS-6022
                 URL: https://issues.apache.org/jira/browse/HDFS-6022
             Project: Hadoop HDFS
          Issue Type: Improvement
          Components: hdfs-client
    Affects Versions: trunk-win
            Reporter: Jack Levin
             Fix For: trunk-win


This patch solves an issue of deadNodes list being thread local.  deadNodes 
list is created by DFSClient when some problems with write/reading, or 
contacting datanode exist.  The problem is that deadNodes is not visible to 
other DFSInputStream threads, hence every DFSInputStream ends up building its 
own deadNodes.  This affect performance of DFSClient to a large degree 
especially when a datanode goes completely offline (there is a tcp connect 
delay experienced by all DFSInputStream threads affecting performance of the 
whole cluster).

This patch moves deadNodes to be global in DFSClient class so that as soon as a 
single DFSInputStream thread reports a dead datanode, all other DFSInputStream 
threads are informed, negating the need to create their own independent lists 
(concurrent Map really). 

Further, a global deadNodes health check manager thread (DeadNodeVerifier) is 
created to verify all dead datanodes every 5 seconds, and remove the same list 
as soon as it is up.  That thread under normal conditions (deadNodes empty) 
would be sleeping.  If deadNodes is not empty, the thread will attempt to open 
tcp connection every 5 seconds to affected datanodes.

This patch has a test (TestDFSClientDeadNodes) that is quite simple, since the 
deadNodes creation is not affected by the patch, we only test datanode removal 
from deadNodes by the health check manager thread.  Test will create a file in 
dfs minicluster, read from the same file rapidly, cause datanode to restart, 
and test is the health check manager thread does the right thing, removing the 
alive datanode from the global deadNodes list.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to