[
https://issues.apache.org/jira/browse/HDFS-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14702079#comment-14702079
]
Daryn Sharp commented on HDFS-8776:
-----------------------------------
The change we used is just skipping the costly scan while
{{isPopulatingReplQueues}} is false. The node state may effectively flip back
to decomm-ing on failover, but it won't take long for the decom manager to mark
all the decomm-ing nodes as decomm-ed. I think this is the correct behavior.
In the active's "current" moment, which is the standby's "future" when it has
queued IBRs, the active may consider the node still decomm-ing. It would be
bad for the standby to decide it's decomm-ed based on stale info.
Post-failover, the standby will be unable to correct under-replication because
decomm-ed nodes are not a valid replication source.
> Decom manager should not be active on standby
> ---------------------------------------------
>
> Key: HDFS-8776
> URL: https://issues.apache.org/jira/browse/HDFS-8776
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: namenode
> Affects Versions: 2.6.0
> Reporter: Daryn Sharp
> Assignee: Daryn Sharp
>
> The decommission manager should not be actively processing on the standby.
> The decomm manager goes through the costly computation for determining every
> block on the node requires replication yet doesn't queue them for replication
> - because it's in standby. The decomm manager is holding the namesystem write
> lock, causing DNs to timeout on heartbeats or IBRs, NN purges the call queue
> of timed out clients, NN processes some heartbeats/IBRs before the decomm
> manager locks up the namesystem again. Nodes attempting to register will be
> sending full BRs which are more costly to send and discard than a heartbeat.
> If a failover is required, the standby will likely have to struggle very hard
> to not GC while "catching up" on its queued IBRs while DNs continue to fill
> the call queue and time out.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)