Daryn Sharp created HDFS-8776:
---------------------------------
Summary: Decom manager should not be active on standby
Key: HDFS-8776
URL: https://issues.apache.org/jira/browse/HDFS-8776
Project: Hadoop HDFS
Issue Type: Bug
Components: namenode
Affects Versions: 2.6.0
Reporter: Daryn Sharp
Assignee: Daryn Sharp
The decommission manager should not be actively processing on the standby.
The decomm manager goes through the costly computation for determining every
block on the node requires replication yet doesn't queue them for replication -
because it's in standby. The decomm manager is holding the namesystem write
lock, causing DNs to timeout on heartbeats or IBRs, NN purges the call queue of
timed out clients, NN processes some heartbeats/IBRs before the decomm manager
locks up the namesystem again. Nodes attempting to register will be sending
full BRs which are more costly to send and discard than a heartbeat.
If a failover is required, the standby will likely have to struggle very hard
to not GC while "catching up" on its queued IBRs while DNs continue to fill the
call queue and time out.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)