[jira] [Commented] (HDFS-13571) Dead DataNode Detector

Lisheng Sun (JIRA) Sat, 13 Jul 2019 22:03:42 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-13571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16884556#comment-16884556
 ]


Lisheng Sun commented on HDFS-13571:
------------------------------------

Thank [~linyiqun] [~jojochuang] for good suggestion.

This idea that HBase read replica + HDFS hedged read is not used in XiaoMi 
HBase.  Because the same read requires twice the bandwidth, we are worried 
about the impact on performance.

Dead DataNode Detector summary as follow:

1. Design node state machine. When an InputStream is opened, a BlockReader is 
opened, and the DataNode involved in the Block is added to the Live Node list 
that DeadNodeDetector will periodically detect the list.If it is found to be 
inaccessible, put the DataNode into the Dead Node. At the same time, the 
InputStream itself will also access the Live Node, and if an error occurs, it 
will be placed in the Suspicious Node list.

2. The Data Node joining Suspicious may be a problem node, and may be access on 
which the block is no longer.Therefore, it needs to be confirmed by re-probing 
and requires a higher priority processing. Because if there is a problem with 
the DataNode, you need to join the Dead Node faster.

3. DeadNodeDetector will periodically detect the Node in the Dead Node list. If 
the access is successful, the Node will be moved to the Live Node list. 
Continuous detection of the dead node is necessary. The DataNode need rejoin 
the cluster due to a service restart/machine repair. The DataNode may be 
permanently excluded if there is no added probe mechanism.

 Patch HDFS-13571-2.6.diff is much old, and we have made a lot of updates 
recently. In the future, I can submit separately according to subtasks.

  Please [~linyiqun] [~jojochuang] help review this idea.  Thank you again.

!屏幕快照 2019-07-14 下午12.27.22.png!

 

> Dead DataNode Detector
> ----------------------
>
>                 Key: HDFS-13571
>                 URL: https://issues.apache.org/jira/browse/HDFS-13571
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: hdfs-client
>    Affects Versions: 2.4.0, 2.6.0, 3.0.2
>            Reporter: Gang Xie
>            Assignee: Lisheng Sun
>            Priority: Minor
>         Attachments: HDFS-13571-2.6.diff, node status machine.png
>
>
> Currently, the information of the dead datanode in DFSInputStream in stored 
> locally. So, it could not be shared among the inputstreams of the same 
> DFSClient. In our production env, every days, some datanodes dies with 
> different causes. At this time, after the first inputstream blocked and 
> detect this, it could share this information to others in the same DFSClient, 
> thus, the ohter inputstreams are still blocked by the dead node for some 
> time, which could cause bad service latency.
> To eliminate this impact from dead datanode, we designed a dead datanode 
> detector, which detect the dead ones in advance, and share this information 
> among all the inputstreams in the same client. This improvement has being 
> online for some months and works fine.  So, we decide to port to the 3.0 (the 
> version used in our production env is 2.4 and 2.6).
> I will do the porting work and upload the code later.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-13571) Dead DataNode Detector

Reply via email to