KevinWikant edited a comment on pull request #3675: URL: https://github.com/apache/hadoop/pull/3675#issuecomment-983034760
> DECOMMISSION_IN_PROGRESS + DEAD is an error state that means decommission has effectively failed. There is a case where it can complete, but what does that really mean - if the node is dead, it has not been gracefully stopped. The case which I have described where dead node decommissioning completes can occur when: - a decommissioning node goes dead, but all of its blocks still have block replicas on other live nodes - the namenode is eventually able to satisfy the minimum replication of all blocks (by replicating the under-replicated blocks from the live nodes) - the dead decommissioning node is transitioned to decommissioned In this case, the node did go dead while decommissioning, but there was no LowRedundancy blocks thanks to redundant block replicas. From the user perspective, the loss of the decommissioning node did not impact the outcome of the decommissioning process. Had the node not gone dead while decommissioning, the eventual outcome is the same in that the node is decommissioned & there is no LowRedundancy blocks. If there is LowRedundancy blocks then a dead datanode will remain decommissioning, because if the dead node were to come alive again then it may be able to recover the LowRedundancy blocks. But if there is no LowRedundancy blocks then the when the node comes alive again it will be immediately transition to decommissioned anyway, so why not make it decommissioned while its still dead? Also, I don't think the priority queue is adding much complexity, it's just putting healthy nodes (with more recent heartbeat times) ahead of unhealthy nodes (with older heartbeat times); such that healthy nodes are decommissioned first ---- I also want to call out another caveat with the approach of removing the node from the DatanodeAdminManager which I uncovered while unit testing If we leave the node in DECOMMISSION_IN_PROGRESS & remove the node from DatanodeAdminManager, then the following callstack should re-add the datanode to the DatanodeAdminManager when it comes alive again: - [DatanodeManager.registerDatanode](https://github.com/apache/hadoop/blob/db89a9411ebee11372314e82d7ea0606c348d014/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java#L1223) - [DatanodeManager.startAdminOperationIfNecessary](https://github.com/apache/hadoop/blob/db89a9411ebee11372314e82d7ea0606c348d014/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java#L1109) - [DatanodeAdminManager.startDecommission](https://github.com/apache/hadoop/blob/62c86eaa0e539a4307ca794e0fcd502a77ebceb8/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java#L187) - [DatanodeAdminMonitorBase.startTrackingNode](https://github.com/apache/hadoop/blob/03cfc852791c14fad39db4e5b14104a276c08e59/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminMonitorBase.java#L136) The problem is this condition "!node.isDecommissionInProgress()": https://github.com/apache/hadoop/blob/62c86eaa0e539a4307ca794e0fcd502a77ebceb8/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java#L177 Because the dead datanode is left in DECOMMISSION_INPROGRESS, "startTrackingNode" is not invoked because of the "!node.isDecommissionInProgress()" condition Simply removing the condition "!node.isDecommissionInProgress()" will not function well because "startTrackingNode" is not idempotent: - [startDecommission is invoked periodically when refreshDatanodes is called](https://github.com/apache/hadoop/blob/db89a9411ebee11372314e82d7ea0606c348d014/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java#L1339) - [pendingNodes is an ArrayDequeue which does not deduplicate the DatanodeDescriptor](https://github.com/apache/hadoop/blob/03cfc852791c14fad39db4e5b14104a276c08e59/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminMonitorBase.java#L43) - therefore, removing the "!node.isDecommissionInProgress()" check will cause a large number of duplicate DatanodeDescriptor to be added to DatanodeAdminManager I can think of 2 obvious ways to handle this: A) make calls to "startTrackingNode" idempotent (meaning that if the DatanodeDescriptor is already tracked, it does not get added to the DatanodeAdminManager) B) modify startDecommission such that its aware of if the invocation is for a datanode which was just restarted after being dead such that it can still invoke "startTrackingNode" even though "node.isDecommissionInProgress()" C) add a flag to DatanodeDescriptor which indicates if the datanode is being tracked within DatanodeAdminManager, then check this flag in startDecommission For "A)", the challenge is that we need to ensure the DatanodeDescriptor is not in "pendingReplication" or "outOfServiceBlocks" which could be a fairly costly call to execute repeatedly. Also, I am not even sure such a check is thread-safe given there is no locking used as part of "startDecommission" or "startTrackingNode" For "B)", the awareness of if a registerDatanode call is related to a restarted datanode is available [here](https://github.com/apache/hadoop/blob/db89a9411ebee11372314e82d7ea0606c348d014/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java#L1177). So this information would need to be passed down the callstack to a method "startDecommission(DatanodeDescriptor node, boolean isNodeRestart)". Because of the modified method signature, all the other invocations of "startDecommission" will need to specify isNodeRestart=false Given this additional hurdle in the approach of removing a dead datanode from the DatanodeAdminManager, are we sure it will be less complex than the proposed changed? ---- In short: - I don't think there is any downside in moving a dead datanode to decommissioned when there are no LowRedundancy blocks because this would happen immediately anyway were the node to come back alive (and get re-added to DatanodeAdminManager) - the approach of removing a dead datanode from the DatanodeAdminManager will not work properly without some significant refactoring of the "startDecommission" method & related code @sodonnel @virajjasani @aajisaka let me know your thoughts, I am still more in favor of tracking dead datanodes in DatanodeAdminManager (when there are LowRedundancy blocks), but if the community thinks its better to remove the dead datanodes from DatanodeAdminManager I can implement proposal "C)" -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
