[ 
https://issues.apache.org/jira/browse/HDFS-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14629236#comment-14629236
 ] 

Ming Ma commented on HDFS-8776:
-------------------------------

Make sense. There might be some operational impact with disabling 
DecommissionManager on standby. admins usually update 
dfs.namenode.hosts.exclude and then call "dfsadmin -refreshNodes" on both 
active and standby around the same time; in that way if NN fails over, decomm 
can continue. If DecommissionManager isn't running on standby, nodes will stay 
in decommission_inprogress state without any progress on standby. As long as 
admins know to ignore decommission state on standby, that should be ok (even if 
we keep DecommissionManager running, decommission states between active and 
standby could be different at any given time).

> Decom manager should not be active on standby
> ---------------------------------------------
>
>                 Key: HDFS-8776
>                 URL: https://issues.apache.org/jira/browse/HDFS-8776
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 2.6.0
>            Reporter: Daryn Sharp
>            Assignee: Daryn Sharp
>
> The decommission manager should not be actively processing on the standby.
> The decomm manager goes through the costly computation for determining every 
> block on the node requires replication yet doesn't queue them for replication 
> - because it's in standby. The decomm manager is holding the namesystem write 
> lock, causing DNs to timeout on heartbeats or IBRs, NN purges the call queue 
> of timed out clients, NN processes some heartbeats/IBRs before the decomm 
> manager locks up the namesystem again. Nodes attempting to register will be 
> sending full BRs which are more costly to send and discard than a heartbeat.
> If a failover is required, the standby will likely have to struggle very hard 
> to not GC while "catching up" on its queued IBRs while DNs continue to fill 
> the call queue and time out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to