KevinWikant commented on pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#issuecomment-983034760


   > DECOMMISSION_IN_PROGRESS + DEAD is an error state that means decommission 
has effectively failed. There is a case where it can complete, but what does 
that really mean - if the node is dead, it has not been gracefully stopped.
   
   The case which I have described where dead node decommissioning completes 
can occur when:
   - a decommissioning node goes dead, but all of its blocks still have block 
replicas on other live nodes
   - the namenode is eventually able to satisfy the minimum replication of all 
blocks (by replicating the under-replicated blocks from the live nodes)
   - the dead decommissioning node is transitioned to decommissioned
   
   In this case, the node did go dead while decommissioning, but there was no 
data loss thanks to redundant block replicas. From the user perspective, the 
loss of the decommissioning node did not impact the outcome of the 
decommissioning process. Had the node not gone dead while decommissioning, the 
eventual outcome is the same in that the node is decommissioned, there is no 
data loss, & all blocks have sufficient replicas.
   
   If there is data loss then a dead datanode will remain decommissioning, 
because if the dead node were to come alive again then it may be able to 
recover the lost data. But if there is no data loss then the when the node 
comes alive again it will be immediately transition to decommissioned anyway, 
so why not make it decommissioned while its still dead (and there is no data 
loss)?
   
   Also, I don't think the priority queue is adding much complexity, it's just 
putting healthy nodes (with more recent heartbeat times) ahead of unhealthy 
nodes (with older heartbeat times) such that healthy nodes are decommissioned 
first
   
   ----
   
   I also want to call out another caveat with the approach of removing the 
node from the DatanodeAdminManager which I  uncovered while unit testing
   
   If we leave the node in DECOMMISSION_IN_PROGRESS & remove the node from 
DatanodeAdminManager, then the following callstack should re-add the datanode 
to the DatanodeAdminManager when it comes alive again:
   - 
[DatanodeManager.registerDatanode](https://github.com/apache/hadoop/blob/db89a9411ebee11372314e82d7ea0606c348d014/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java#L1223)
   - 
[DatanodeManager.startAdminOperationIfNecessary](https://github.com/apache/hadoop/blob/db89a9411ebee11372314e82d7ea0606c348d014/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java#L1109)
   - 
[DatanodeAdminManager.startDecommission](https://github.com/apache/hadoop/blob/62c86eaa0e539a4307ca794e0fcd502a77ebceb8/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java#L187)
   - 
[DatanodeAdminMonitorBase.startTrackingNode](https://github.com/apache/hadoop/blob/03cfc852791c14fad39db4e5b14104a276c08e59/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminMonitorBase.java#L136)
   
   The problem is this condition "!node.isDecommissionInProgress()": 
https://github.com/apache/hadoop/blob/62c86eaa0e539a4307ca794e0fcd502a77ebceb8/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java#L177
   
   Because the dead datanode is left in DECOMMISSION_INPROGRESS, 
"startTrackingNode" is not invoked because of the 
"!node.isDecommissionInProgress()" condition
   
   Simply removing the condition "!node.isDecommissionInProgress()" will not 
function well because "startTrackingNode" is not idempotent:
   - [startDecommission is invoked periodically when refreshDatanodes is 
called](https://github.com/apache/hadoop/blob/db89a9411ebee11372314e82d7ea0606c348d014/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java#L1339)
   - [pendingNodes is an ArrayDequeue which does not deduplicate the 
DatanodeDescriptor](https://github.com/apache/hadoop/blob/03cfc852791c14fad39db4e5b14104a276c08e59/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminMonitorBase.java#L43)
   - therefore, removing the "!node.isDecommissionInProgress()" check will 
cause a large number of duplicate DatanodeDescriptor to be added to 
DatanodeAdminManager
   
   I can think of 2 obvious ways to handle this:
   A) make calls to "startTrackingNode" idempotent (meaning that if the 
DatanodeDescriptor is already tracked, it does not get added to the 
DatanodeAdminManager)
   B) modify startDecommission such that its aware of if the invocation is for 
a datanode which was just restarted after being dead such that it can still 
invoke "startTrackingNode" even though "node.isDecommissionInProgress()"
   
   For "A)", the challenge is that we need to ensure the DatanodeDescriptor is 
not in "pendingReplication" or "outOfServiceBlocks" which could be a fairly 
costly call to execute repeatedly. Also, I am not even sure such a check is 
thread-safe given there is no locking used as part of "startDecommission" or 
"startTrackingNode"
   
   For "B)", the awareness of if a registerDatanode call is related to a 
[restarted datanode is available 
here](https://github.com/apache/hadoop/blob/db89a9411ebee11372314e82d7ea0606c348d014/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java#L1177).
 So this information would need to be passed down the callstack to a method 
"startDecommission(DatanodeDescriptor node, boolean isNodeRestart)". Because of 
the modified method signature, all the other invocations of "startDecommission" 
will need to specify isNodeRestart=false
   
   Given this additional hurdle in the approach of removing a dead datanode 
from the DatanodeAdminManager, are we sure it will be less complex/impactful 
than the proposed changed?
   
   ----
   
   In short:
   - I don't think there is any downside in moving a dead datanode to 
decommissioned when there are no LowRedundancy blocks because this would happen 
immediately anyway were the node to come back alive (and get re-added to 
DatanodeAdminManager)
   - the approach of removing a dead datanode from the DatanodeAdminManager 
will not work properly without some significant refactoring of the 
"startDecommission" method & related code
   
   @sodonnel @virajjasani @aajisaka let me know your thoughts, I am still more 
in favor of tracking dead datanodes in DatanodeAdminManager (when there are 
LowRedundancy blocks), but if the community thinks its better to remove the 
dead datanodes from DatanodeAdminManager I can implement proposal "B)"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to