[
https://issues.apache.org/jira/browse/HDFS-6022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Masatake Iwasaki updated HDFS-6022:
-----------------------------------
Labels: patch (was: BB2015-05-TBR patch)
> Moving deadNodes from being thread local. Improving dead datanode handling in
> DFSClient
> ----------------------------------------------------------------------------------------
>
> Key: HDFS-6022
> URL: https://issues.apache.org/jira/browse/HDFS-6022
> Project: Hadoop HDFS
> Issue Type: Improvement
> Components: hdfs-client
> Affects Versions: 3.0.0, 0.23.9, 0.23.10, 2.2.0, 2.3.0
> Reporter: Jack Levin
> Assignee: Masatake Iwasaki
> Labels: patch
> Attachments: HADOOP-6022.patch
>
> Original Estimate: 0h
> Remaining Estimate: 0h
>
> This patch solves an issue of deadNodes list being thread local. deadNodes
> list is created by DFSClient when some problems with write/reading, or
> contacting datanode exist. The problem is that deadNodes is not visible to
> other DFSInputStream threads, hence every DFSInputStream ends up building its
> own deadNodes. This affect performance of DFSClient to a large degree
> especially when a datanode goes completely offline (there is a tcp connect
> delay experienced by all DFSInputStream threads affecting performance of the
> whole cluster).
> This patch moves deadNodes to be global in DFSClient class so that as soon as
> a single DFSInputStream thread reports a dead datanode, all other
> DFSInputStream threads are informed, negating the need to create their own
> independent lists (concurrent Map really).
> Further, a global deadNodes health check manager thread (DeadNodeVerifier) is
> created to verify all dead datanodes every 5 seconds, and remove the same
> list as soon as it is up. That thread under normal conditions (deadNodes
> empty) would be sleeping. If deadNodes is not empty, the thread will attempt
> to open tcp connection every 5 seconds to affected datanodes.
> This patch has a test (TestDFSClientDeadNodes) that is quite simple, since
> the deadNodes creation is not affected by the patch, we only test datanode
> removal from deadNodes by the health check manager thread. Test will create
> a file in dfs minicluster, read from the same file rapidly, cause datanode to
> restart, and test is the health check manager thread does the right thing,
> removing the alive datanode from the global deadNodes list.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)