KevinWikant opened a new pull request, #4410:
URL: https://github.com/apache/hadoop/pull/4410
HDFS-16064. Determine when to invalidate corrupt replicas based on number of
usable replicas
### Description of PR
Bug fix for a re-occurring HDFS bug which can result in datanodes being
unable to complete decommissioning indefinitely. In short, the bug is a chicken
& egg problem where:
- in order for a datanode to be decommissioned its blocks must be
sufficiently replicated
- datanode cannot sufficiently replicate some block(s) because of corrupt
block replicas on target datanodes
- corrupt block replicas will not be invalidated because the block(s) are
not sufficiently replicated
In this scenario, the block(s) are sufficiently replicated but the logic the
Namenode uses to determine if a block is sufficiently replicated is flawed.
To understand the bug further we must first establish some background
information.
#### Background Information
Givens:
- FSDataOutputStream is being used to write the HDFS file, under the hood
this uses a class DataStreamer
- for the sake of example we will say the HDFS file has a replication factor
of 2, though this is not a requirement to reproduce the issue
- the file is being appended to intermittently over an extended period of
time (in general, this issue needs minutes/hours to reproduce)
- HDFS is configured with typical default configurations
Under certain scenarios the DataStreamer client can detect a bad link when
trying to append to the block pipeline, in this case the DataStreamer client
can shift the block pipeline by replacing the bad link with a new datanode.
When this happens the replica on the datanode that was shifted away from
becomes corrupted because it no longer has the latest generation stamp for the
block. As a more concrete example:
- DataStreamer client creates block pipeline on datanodes A & B, each have a
block replica with generation stamp 1
- DataStreamer client tries to append the block pipeline by sending block
transfer (with generation stamp 2) to datanode A
- Datanode A succeeds in writing the block transfer & then attempts to
forward the transfer to datanode B
- Datanode B fails the transfer for some reason and responds with a
PipelineAck failure code
- Datanode A sends a PipelineAck to DataStreamer indicating datanode A
succeeded in the append & datanode B failed in the append. The DataStreamer
detects datanode B as a bad link which will be replaced before the next append
operation
- at this point datanode A has live replica with generation stamp 2 &
datanode B has corrupt replica with generation stamp 1
- the next time DataStreamer tries to append the block it will call Namenode
"getAdditionalDatanode" API which returns some other datanode C
- DataStreamer sends data transfer (with generation stamp 3) to the new
block pipeline containing datanodes A & C, the append succeeds to both datanodes
- end state is that:
- datanodes A & C have live replicas with latest generation stamp 3
- datanode B has a corrupt replica because its lagging behind with
generation stamp 1
The key behavior being highlighted here is that when the DataStreamer client
shifts the block pipeline due to append failures on a subset of the datanodes
in the pipeline, a corrupt block replica gets leftover on the datanode that was
shifted away from.
This corrupt block replica makes the datanode ineligible as a replication
target for the block because of the following exception:
```
2021-06-06 10:38:23,604 INFO org.apache.hadoop.hdfs.server.datanode.DataNode
(DataXceiver for client at /DN2:45654 [Receiving block BP-YYY:blk_XXX]):
DN3:9866:DataXceiver error processing WRITE_BLOCK operation src: /DN2:45654
dst: /DN3:9866;
org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block
BP-YYY:blk_XXX already exists in state FINALIZED and thus cannot be created.
```
What typically will occur is that these corrupt block replicas will be
invalidated by the Namenode which will cause the corrupt replica to the be
deleted on the datanode, thus allowing the datanode to once again be a
replication target for the block. Note that the Namenode will not identify the
corrupt block replica until the datanode sends its next block report, this can
take up to 6 hours with the default block report interval.
As long as there is 1 live replica of the block, all the corrupt replicas
should be invalidated based on this condition:
https://github.com/apache/hadoop/blob/7bd7725532fd139d2e0e1662df7700f7ab95067a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java#L1928
When there are 0 live replicas the corrupt replicas are not invalidated, I
assume the reasoning behind this is that its better to have some corrupt copies
of the block then to have no copies at all.
#### Description of Problem
The problem comes into play when the aforementioned behavior is coupled with
decommissioning and/or entering maintenance.
Consider the following scenario:
- block has replication factor of 2
- there are 3 datanodes A, B, & C
- datanode A has decommissioning replica
- datanodes B & C have corrupt replicas
This scenario is essentially a decommissioning & replication deadlock
because:
- corrupt replicas on B & C will not be invalidated because there are 0 live
replicas (as per Namenode logic)
- datanode A cannot finish decommissioning until the block is replicated to
datanodes B & C
- the block cannot be replicated to datanodes B & C until their corrupt
replicas are invalidated
This does not need to be a deadlock, the corrupt replicas could be
invalidated & the live replica could be transferred from A to B & C.
The same problem can exist on a larger scale, the requirements are that:
- liveReplicas < minReplicationFactor (minReplicationFactor=1 by default)
- decommissioningAndEnteringMaintenanceReplicas > 0
- liveReplicas + decommissioningAndEnteringMaintenanceReplicas +
corruptReplicas = numberOfDatanodes
In this case the corrupt replicas will not be invalidated by the Namenode
which means that the decommissioning and entering maintenance replicas will
never be sufficiently replicated and therefore will never finish
decommissioning or entering maintenance.
The symptom of this issue in the logs is that right after a node with a
corrupt replica sends its block report, rather than the block being invalidated
it just gets counted as a corrupt replica:
```
TODO
```
#### Proposed Solution
Rather than checking if minReplicationSatisfied based on live replicas,
check based on usable replicas (i.e. live + decommissioning +
enteringMaintenance). This will allow the corrupt replicas to be invalidated &
the live replica(s) on the decommissioning node(s) to be sufficiently
replicated.
The only perceived risk here would be that the corrupt blocks are
invalidated at around the same time the decommissioning and entering
maintenance nodes are decommissioned. This could in theory bring the overall
number of replicas below the minReplicationFactor (to 0 in the worst case).
This is however not an actual risk because the decommissioning and entering
maintenance nodes will not finish decommissioning until they have a sufficient
number of live replicas; so there is no possibility that the decommissioning
and entering maintenance nodes will be decommissioned prematurely.
### How was this patch tested?
Added a unit test "testDeleteCorruptReplicaForUnderReplicatedBlock"
- TODO
### For code changes:
- [X] Does the title or this PR starts with the corresponding JIRA issue id
(e.g. 'HADOOP-17799. Your PR title ...')?
- [n/a] Object storage: have the integration tests been executed and the
endpoint declared according to the connector-specific documentation?
- [n/a] If adding new dependencies to the code, are these dependencies
licensed in a way that is compatible for inclusion under [ASF
2.0](http://www.apache.org/legal/resolved.html#category-a)?
- [n/a] If applicable, have you updated the `LICENSE`, `LICENSE-binary`,
`NOTICE-binary` files?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]