KevinWikant commented on PR #4410:
URL: https://github.com/apache/hadoop/pull/4410#issuecomment-1154627715

   Hi @ashutoshcipher thank you for reviewing the PR! I have tried to address 
your comments below:
   
   ## I am thinking could there be a case where the block can show up in both 
liveReplicas and decommissioning, which could lead to any unnecessarily call to 
invalidateCorruptReplicas()
   
   I am not sure its possible for a block replica to be reported as both a 
liveReplica & a decommissioningReplica at the same time. Its my understanding 
that a block replica is on decommissioning datanode is counted as a 
decommissioningReplica & a non-corrupt replica on a live datanode is counted as 
a liveReplica. So a block replica on a decommissioning node will only be 
counted towards decommissioningReplica count & not liveReplica count. I have 
never seen the behavior you are mentioning in my experience, let me know if you 
have a reference JIRA.
   
   If the case you described is possible then in theory numUsableReplica would 
be greater than it should be. In the typical case where 
"dfs.namenode.replication.min = 1" this makes no difference because even if 
there is only 1 non-corrupt block replica on 1 decommissioning node then the 
corrupt blocks are invalidated regardless of wether or not numUsableReplica=1 
or numUsableReplica=2 (due to double counting of replica as liveReplica & 
decommissioningReplica). In the case where "dfs.namenode.replication.min > 1" 
there could arguably be a difference because the corrupt replicas would not be 
invalidated if numUsableReplica=1 but they will be invalidated if 
numUsableReplica=2 (due to double counting of replica as liveReplica & 
decommissioningReplica).
   
   I think if this scenario is possible the correct fix would be to ensure that 
each block replica is only counted once towards liveReplica or 
decommissioningReplica. Let me know if there is a JIRA for this issue & I can 
potentially look into the bug fix separately.
   
   ## Edge case coming to my mind is when the we are considering the block on 
decommissioning node as useable but the very next moment, the node is 
decommissioned.
   
   Fair point, I had considered this & mentioned this edge case in the PR 
description:
   
   ```
   The only perceived risk here would be that the corrupt blocks are 
invalidated at around the same time the decommissioning and entering 
maintenance nodes are decommissioned. This could in theory bring the overall 
number of replicas below the "dfs.namenode.replication.min" (i.e. to 0 replicas 
in the worst case). This is however not an actual risk because the 
decommissioning and entering maintenance nodes will not finish decommissioning 
until their is a sufficient number of liveReplicas; so there is no possibility 
that the decommissioning and entering maintenance nodes will be decommissioned 
prematurely.
   ```
   
   Any replicas on decommissioning nodes will not be decommissioned until there 
are more liveReplicas than the default replication factor for the HDFS file. So 
its only possible for decommissioningReplicas to be decommissioned at the same 
time corruptReplicas are invalidated if there are sufficient liveReplicas to 
satisfy the replication factor; because of the live replicas its safe to 
eliminate both the decommissioningReplicas & the corruptReplicas. If there is 
not a sufficient number of liveReplicas then the decommissioningReplicas will 
not be decommissioned but the corruptReplicas will be invalidated; the block 
will not be lost because the decommissioningReplicas will still exist.
   
   One case you could argue is that if:
   - corruptReplicas > 1
   - decommissioningReplicas = 1
   
   By invalidating the corruptReplicas we are exposing the block to a risk of 
the loss should the decommissioningReplica be unexpectedly terminated/failed. 
This is true, but this same risk already exists where:
   - corruptReplicas > 1
   - liveReplicas = 1
   By invalidating the corruptReplicas we are exposing the block to a risk of 
the loss should the liveReplica be unexpectedly terminated/failed.
   
   So I don't think this change introduces any new risk of data loss, in fact 
it helps improve block redundancy in scenarios where a block cannot be 
sufficiently replicated because the corruptReplicas cannot be invalidated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to