KevinWikant opened a new pull request, #4410:
URL: https://github.com/apache/hadoop/pull/4410

   HDFS-16064. Determine when to invalidate corrupt replicas based on number of 
usable replicas
   
   ### Description of PR
   
   Bug fix for a re-occurring HDFS bug which can result in datanodes being 
unable to complete decommissioning indefinitely. In short, the bug is a chicken 
& egg problem where:
   - in order for a datanode to be decommissioned its blocks must be 
sufficiently replicated
   - datanode cannot sufficiently replicate some block(s) because of corrupt 
block replicas on target datanodes
   - corrupt block replicas will not be invalidated because the block(s) are 
not sufficiently replicated
   
   In this scenario, the block(s) are sufficiently replicated but the logic the 
Namenode uses to determine if a block is sufficiently replicated is flawed.
   
   To understand the bug further we must first establish some background 
information.
   
   #### Background Information
   
   Givens:
   - FSDataOutputStream is being used to write the HDFS file, under the hood 
this uses a class DataStreamer
   - for the sake of example we will say the HDFS file has a replication factor 
of 2, though this is not a requirement to reproduce the issue
   - the file is being appended to intermittently over an extended period of 
time (in general, this issue needs minutes/hours  to reproduce)
   - HDFS is configured with typical default configurations
   
   Under certain scenarios the DataStreamer client can detect a bad link when 
trying to append to the block pipeline, in this case the DataStreamer client 
can shift the block pipeline by replacing the bad link with a new datanode. 
When this happens the replica on the datanode that was shifted away from 
becomes corrupted because it no longer has the latest generation stamp for the 
block. As a more concrete example:
   - DataStreamer client creates block pipeline on datanodes A & B, each have a 
block replica with generation stamp 1
   - DataStreamer client tries to append the block pipeline by sending block 
transfer (with generation stamp 2) to datanode A
   - Datanode A succeeds in writing the block transfer & then attempts to 
forward the transfer to datanode B
   - Datanode B fails the transfer for some reason and responds with a 
PipelineAck failure code
   - Datanode A sends a PipelineAck to DataStreamer indicating datanode A 
succeeded in the append & datanode B failed in the append. The DataStreamer 
detects datanode B as a bad link which will be replaced before the next append 
operation
   - at this point datanode A has live replica with generation stamp 2 & 
datanode B has corrupt replica with generation stamp 1
   - the next time DataStreamer tries to append the block it will call Namenode 
"getAdditionalDatanode" API which returns some other datanode C
   - DataStreamer sends data transfer (with generation stamp 3) to the new 
block pipeline containing datanodes A & C, the append succeeds to both datanodes
   - end state is that:
     - datanodes A & C have live replicas with latest generation stamp 3
     - datanode B has a corrupt replica because its lagging behind with 
generation stamp 1
   
   The key behavior being highlighted here is that when the DataStreamer client 
shifts the block pipeline due to append failures on a subset of the datanodes 
in the pipeline, a corrupt block replica gets leftover on the datanode that was 
shifted away from.
   
   This corrupt block replica makes the datanode ineligible as a replication 
target for the block because of the following exception:
   
   ```
   2021-06-06 10:38:23,604 INFO org.apache.hadoop.hdfs.server.datanode.DataNode 
(DataXceiver for client  at /DN2:45654 [Receiving block BP-YYY:blk_XXX]): 
DN3:9866:DataXceiver error processing WRITE_BLOCK operation  src: /DN2:45654 
dst: /DN3:9866; 
org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block 
BP-YYY:blk_XXX already exists in state FINALIZED and thus cannot be created.
   ```
   
   What typically will occur is that these corrupt block replicas will be 
invalidated by the Namenode which will cause the corrupt replica to the be 
deleted on the datanode, thus allowing the datanode to once again be a 
replication target for the block. Note that the Namenode will not identify the 
corrupt block replica until the datanode sends its next block report, this can 
take up to 6 hours with the default block report interval.
   
   As long as there is 1 live replica of the block, all the corrupt replicas 
should be invalidated based on this condition: 
https://github.com/apache/hadoop/blob/7bd7725532fd139d2e0e1662df7700f7ab95067a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java#L1928
   
   When there are 0 live replicas the corrupt replicas are not invalidated, I 
assume the reasoning behind this is that its better to have some corrupt copies 
of the block then to have no copies at all.
   
   #### Description of Problem
   
   The problem comes into play when the aforementioned behavior is coupled with 
decommissioning and/or entering maintenance.
   
   Consider the following scenario:
   - block has replication factor of 2
   - there are 3 datanodes A, B, & C
   - datanode A has decommissioning replica
   - datanodes B & C have corrupt replicas
   
   This scenario is essentially a decommissioning & replication deadlock 
because:
   - corrupt replicas on B & C will not be invalidated because there are 0 live 
replicas (as per Namenode logic)
   - datanode A cannot finish decommissioning until the block is replicated to 
datanodes B & C
   - the block cannot be replicated to datanodes B & C until their corrupt 
replicas are invalidated
   
   This does not need to be a deadlock, the corrupt replicas could be 
invalidated & the live replica could be transferred from A to B & C.
   
   The same problem can exist on a larger scale, the requirements are that:
   - liveReplicas < minReplicationFactor (minReplicationFactor=1 by default)
   - decommissioningAndEnteringMaintenanceReplicas > 0
   - liveReplicas + decommissioningAndEnteringMaintenanceReplicas + 
corruptReplicas = numberOfDatanodes
   
   In this case the corrupt replicas will not be invalidated by the Namenode 
which means that the decommissioning and entering maintenance replicas will 
never be sufficiently replicated and therefore will never finish 
decommissioning or entering maintenance.
   
   The symptom of this issue in the logs is that right after a node with a 
corrupt replica sends its block report, rather than the block being invalidated 
it just gets counted as a corrupt replica:
   
   ```
   TODO
   ```
   
   #### Proposed Solution
   
   Rather than checking if minReplicationSatisfied based on live replicas, 
check based on usable replicas (i.e. live + decommissioning + 
enteringMaintenance). This will allow the corrupt replicas to be invalidated & 
the live replica(s) on the decommissioning node(s) to be sufficiently 
replicated.
   
   The only perceived risk here would be that the corrupt blocks are 
invalidated at around the same time the decommissioning and entering 
maintenance nodes are decommissioned. This could in theory bring the overall 
number of replicas below the minReplicationFactor (to 0 in the worst case). 
This is however not an actual risk because the decommissioning and entering 
maintenance nodes will not finish decommissioning until they have a sufficient 
number of live replicas; so there is no possibility that the decommissioning 
and entering maintenance nodes will be decommissioned prematurely.
   
   ### How was this patch tested?
   
   Added a unit test "testDeleteCorruptReplicaForUnderReplicatedBlock"
   
   - TODO
   
   ### For code changes:
   
   - [X] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [n/a] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [n/a] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [n/a] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to