KevinWikant commented on PR #4410: URL: https://github.com/apache/hadoop/pull/4410#issuecomment-1154627715
Hi @ashutoshcipher thank you for reviewing the PR! I have tried to address your comments below: ## I am thinking could there be a case where the block can show up in both liveReplicas and decommissioning, which could lead to any unnecessarily call to invalidateCorruptReplicas() I am not sure its possible for a block replica to be reported as both a liveReplica & a decommissioningReplica at the same time. Its my understanding that a block replica is on decommissioning datanode is counted as a decommissioningReplica & a non-corrupt replica on a live datanode is counted as a liveReplica. So a block replica on a decommissioning node will only be counted towards decommissioningReplica count & not liveReplica count. I have never seen the behavior you are mentioning in my experience, let me know if you have a reference JIRA. If the case you described is possible then in theory numUsableReplica would be greater than it should be. In the typical case where "dfs.namenode.replication.min = 1" this makes no difference because even if there is only 1 non-corrupt block replica on 1 decommissioning node then the corrupt blocks are invalidated regardless of wether or not numUsableReplica=1 or numUsableReplica=2 (due to double counting of replica as liveReplica & decommissioningReplica). In the case where "dfs.namenode.replication.min > 1" there could arguably be a difference because the corrupt replicas would not be invalidated if numUsableReplica=1 but they will be invalidated if numUsableReplica=2 (due to double counting of replica as liveReplica & decommissioningReplica). I think if this scenario is possible the correct fix would be to ensure that each block replica is only counted once towards liveReplica or decommissioningReplica. Let me know if there is a JIRA for this issue & I can potentially look into the bug fix separately. ## Edge case coming to my mind is when the we are considering the block on decommissioning node as useable but the very next moment, the node is decommissioned. Fair point, I had considered this & mentioned this edge case in the PR description: ``` The only perceived risk here would be that the corrupt blocks are invalidated at around the same time the decommissioning and entering maintenance nodes are decommissioned. This could in theory bring the overall number of replicas below the "dfs.namenode.replication.min" (i.e. to 0 replicas in the worst case). This is however not an actual risk because the decommissioning and entering maintenance nodes will not finish decommissioning until their is a sufficient number of liveReplicas; so there is no possibility that the decommissioning and entering maintenance nodes will be decommissioned prematurely. ``` Any replicas on decommissioning nodes will not be decommissioned until there are more liveReplicas than the default replication factor for the HDFS file. So its only possible for decommissioningReplicas to be decommissioned at the same time corruptReplicas are invalidated if there are sufficient liveReplicas to satisfy the replication factor; because of the live replicas its safe to eliminate both the decommissioningReplicas & the corruptReplicas. If there is not a sufficient number of liveReplicas then the decommissioningReplicas will not be decommissioned but the corruptReplicas will be invalidated; the block will not be lost because the decommissioningReplicas will still exist. One case you could argue is that if: - corruptReplicas > 1 - decommissioningReplicas = 1 By invalidating the corruptReplicas we are exposing the block to a risk of the loss should the decommissioningReplica be unexpectedly terminated/failed. This is true, but this same risk already exists where: - corruptReplicas > 1 - liveReplicas = 1 By invalidating the corruptReplicas we are exposing the block to a risk of the loss should the liveReplica be unexpectedly terminated/failed. So I don't think this change introduces any new risk of data loss, in fact it helps improve block redundancy in scenarios where a block cannot be sufficiently replicated because the corruptReplicas cannot be invalidated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
