KevinWikant edited a comment on pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#issuecomment-983034760


   > DECOMMISSION_IN_PROGRESS + DEAD is an error state that means decommission 
has effectively failed. There is a case where it can complete, but what does 
that really mean - if the node is dead, it has not been gracefully stopped.
   
   The case which I have described where dead node decommissioning completes 
can occur when:
   - a decommissioning node goes dead, but all of its blocks still have block 
replicas on other live nodes
   - the namenode is eventually able to satisfy the minimum replication of all 
blocks (by replicating the under-replicated blocks from the live nodes)
   - the dead decommissioning node is transitioned to decommissioned
   
   In this case, the node did go dead while decommissioning, but there was no 
LowRedundancy blocks thanks to redundant block replicas. From the user 
perspective, the loss of the decommissioning node did not impact the outcome of 
the decommissioning process. Had the node not gone dead while decommissioning, 
the eventual outcome is the same in that the node is decommissioned & there is 
no LowRedundancy blocks.
   
   If there is LowRedundancy blocks then a dead datanode will remain 
decommissioning, because if the dead node were to come alive again then it may 
be able to recover the LowRedundancy blocks. But if there is no LowRedundancy 
blocks then the when the node comes alive again it will be immediately 
transition to decommissioned anyway, so why not make it decommissioned while 
its still dead?
   
   Also, I don't think the priority queue is adding much complexity, it's just 
putting healthy nodes (with more recent heartbeat times) ahead of unhealthy 
nodes (with older heartbeat times); such that healthy nodes are decommissioned 
first
   
   ----
   
   I also want to call out another caveat with the approach of removing the 
node from the DatanodeAdminManager which I  uncovered while unit testing
   
   If we leave the node in DECOMMISSION_IN_PROGRESS & remove the node from 
DatanodeAdminManager, then the following callstack should re-add the datanode 
to the DatanodeAdminManager when it comes alive again:
   - 
[DatanodeManager.registerDatanode](https://github.com/apache/hadoop/blob/db89a9411ebee11372314e82d7ea0606c348d014/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java#L1223)
   - 
[DatanodeManager.startAdminOperationIfNecessary](https://github.com/apache/hadoop/blob/db89a9411ebee11372314e82d7ea0606c348d014/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java#L1109)
   - 
[DatanodeAdminManager.startDecommission](https://github.com/apache/hadoop/blob/62c86eaa0e539a4307ca794e0fcd502a77ebceb8/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java#L187)
   - 
[DatanodeAdminMonitorBase.startTrackingNode](https://github.com/apache/hadoop/blob/03cfc852791c14fad39db4e5b14104a276c08e59/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminMonitorBase.java#L136)
   
   The problem is this condition "!node.isDecommissionInProgress()": 
https://github.com/apache/hadoop/blob/62c86eaa0e539a4307ca794e0fcd502a77ebceb8/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java#L177
   
   Because the dead datanode is left in DECOMMISSION_INPROGRESS, 
"startTrackingNode" is not invoked because of the 
"!node.isDecommissionInProgress()" condition
   
   Simply removing the condition "!node.isDecommissionInProgress()" will not 
function well because "startTrackingNode" is not idempotent:
   - [startDecommission is invoked periodically when refreshDatanodes is 
called](https://github.com/apache/hadoop/blob/db89a9411ebee11372314e82d7ea0606c348d014/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java#L1339)
   - [pendingNodes is an ArrayDequeue which does not deduplicate the 
DatanodeDescriptor](https://github.com/apache/hadoop/blob/03cfc852791c14fad39db4e5b14104a276c08e59/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminMonitorBase.java#L43)
   - therefore, removing the "!node.isDecommissionInProgress()" check will 
cause a large number of duplicate DatanodeDescriptor to be added to 
DatanodeAdminManager
   
   I can think of 2 obvious ways to handle this:
   A) make calls to "startTrackingNode" idempotent (meaning that if the 
DatanodeDescriptor is already tracked, it does not get added to the 
DatanodeAdminManager)
   B) modify startDecommission such that its aware of if the invocation is for 
a datanode which was just restarted after being dead such that it can still 
invoke "startTrackingNode" even though "node.isDecommissionInProgress()"
   
   For "A)", the challenge is that we need to ensure the DatanodeDescriptor is 
not in "pendingReplication" or "outOfServiceBlocks" which could be a fairly 
costly call to execute repeatedly. Also, I am not even sure such a check is 
thread-safe given there is no locking used as part of "startDecommission" or 
"startTrackingNode"
   
   For "B)", the awareness of if a registerDatanode call is related to a 
restarted datanode is available 
[here](https://github.com/apache/hadoop/blob/db89a9411ebee11372314e82d7ea0606c348d014/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java#L1177).
 So this information would need to be passed down the callstack to a method 
"startDecommission(DatanodeDescriptor node, boolean isNodeRestart)". Because of 
the modified method signature, all the other invocations of "startDecommission" 
will need to specify isNodeRestart=false
   
   Given this additional hurdle in the approach of removing a dead datanode 
from the DatanodeAdminManager, are we sure it will be less complex than the 
proposed changed?
   
   ----
   
   In short:
   - I don't think there is any downside in moving a dead datanode to 
decommissioned when there are no LowRedundancy blocks because this would happen 
immediately anyway were the node to come back alive (and get re-added to 
DatanodeAdminManager)
   - the approach of removing a dead datanode from the DatanodeAdminManager 
will not work properly without some significant refactoring of the 
"startDecommission" method & related code
   
   @sodonnel @virajjasani @aajisaka let me know your thoughts, I am still more 
in favor of tracking dead datanodes in DatanodeAdminManager (when there are 
LowRedundancy blocks), but if the community thinks its better to remove the 
dead datanodes from DatanodeAdminManager I can implement proposal "B)"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to