[jira] [Comment Edited] (HDFS-15809) DeadNodeDetector doesn't remove live nodes from dead node set.

Jinglun (Jira) Thu, 18 Feb 2021 19:24:06 -0800


    [ 
https://issues.apache.org/jira/browse/HDFS-15809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17286832#comment-17286832
 ]


Jinglun edited comment on HDFS-15809 at 2/19/21, 3:23 AM:
----------------------------------------------------------

Hi [~leosun08], thanks you comments. The solution in v01 introduces a new 
deduplicated queue. It won't accept duplicated nodes being added. The size of 
the queue is not fixed too so all the dead nodes could be added to the 
deduplicated queue. Thus the situation of duplicated dead nodes being 
repeatedly added to the probe queue won't happen anymore.

The queue itself is deduplicated so we don't need to worry the queue size 
explosion. The size is no greater than the size of datanodes.

Shuffle is a good idea and is a much simpler way. But I think the deduplicated 
way is more efficiency because there is no duplicated probe.

Adjust the queue size won't fix the problem because the queue accept duplicated 
nodes. Even the queue size is 100000 it could still be filled up with the first 
30 nodes.

 


was (Author: lijinglun):
Hi [~leosun08], thanks you comments. The solution in v01 is to avoid adding 
duplicated dead nodes to the probe queue. So the queue won't be filled up with 
duplicated dead nodes.

Shuffle is a good idea and is a much simpler way. I also agree with the shuffle 
way.

 

> DeadNodeDetector doesn't remove live nodes from dead node set.
> --------------------------------------------------------------
>
>                 Key: HDFS-15809
>                 URL: https://issues.apache.org/jira/browse/HDFS-15809
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Jinglun
>            Assignee: Jinglun
>            Priority: Major
>         Attachments: HDFS-15809.001.patch
>
>
> We found the dead node detector might never remove the alive nodes from the 
> dead node set in a big cluster. For example:
>  # 200 nodes are added to the dead node set by DeadNodeDetector.
>  # DeadNodeDetector#checkDeadNodes() adds 100 nodes to the 
> deadNodesProbeQueue because the queue limited length is 100.
>  # The probe threads start working and probe 30 nodes.
>  # DeadNodeDetector#checkDeadNodes() is scheduled again. It iterates the dead 
> node set  and adds 30 nodes to the deadNodesProbeQueue. But the order is the 
> same as the last time. So the 30 nodes that has already been probed are added 
> to the queue again.
>  # Repeat 3 and 4. But we always add the first 30 nodes from the dead set. If 
> they are all dead then the live nodes behind them could never be recovered.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (HDFS-15809) DeadNodeDetector doesn't remove live nodes from dead node set.

Reply via email to