[
https://issues.apache.org/jira/browse/HDFS-16064?focusedWorklogId=778639&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778639
]
ASF GitHub Bot logged work on HDFS-16064:
-----------------------------------------
Author: ASF GitHub Bot
Created on: 06/Jun/22 14:22
Start Date: 06/Jun/22 14:22
Worklog Time Spent: 10m
Work Description: KevinWikant opened a new pull request, #4410:
URL: https://github.com/apache/hadoop/pull/4410
HDFS-16064. Determine when to invalidate corrupt replicas based on number of
usable replicas
### Description of PR
Bug fix for a re-occurring HDFS bug which can result in datanodes being
unable to complete decommissioning indefinitely. In short, the bug is a chicken
& egg problem where:
- in order for a datanode to be decommissioned its blocks must be
sufficiently replicated
- datanode cannot sufficiently replicate some block(s) because of corrupt
block replicas on target datanodes
- corrupt block replicas will not be invalidated because the block(s) are
not sufficiently replicated
In this scenario, the block(s) are sufficiently replicated but the logic the
Namenode uses to determine if a block is sufficiently replicated is flawed.
To understand the bug further we must first establish some background
information.
#### Background Information
Givens:
- FSDataOutputStream is being used to write the HDFS file, under the hood
this uses a class DataStreamer
- for the sake of example we will say the HDFS file has a replication factor
of 2, though this is not a requirement to reproduce the issue
- the file is being appended to intermittently over an extended period of
time (in general, this issue needs minutes/hours to reproduce)
- HDFS is configured with typical default configurations
Under certain scenarios the DataStreamer client can detect a bad link when
trying to append to the block pipeline, in this case the DataStreamer client
can shift the block pipeline by replacing the bad link with a new datanode.
When this happens the replica on the datanode that was shifted away from
becomes corrupted because it no longer has the latest generation stamp for the
block. As a more concrete example:
- DataStreamer client creates block pipeline on datanodes A & B, each have a
block replica with generation stamp 1
- DataStreamer client tries to append the block pipeline by sending block
transfer (with generation stamp 2) to datanode A
- Datanode A succeeds in writing the block transfer & then attempts to
forward the transfer to datanode B
- Datanode B fails the transfer for some reason and responds with a
PipelineAck failure code
- Datanode A sends a PipelineAck to DataStreamer indicating datanode A
succeeded in the append & datanode B failed in the append. The DataStreamer
detects datanode B as a bad link which will be replaced before the next append
operation
- at this point datanode A has live replica with generation stamp 2 &
datanode B has corrupt replica with generation stamp 1
- the next time DataStreamer tries to append the block it will call Namenode
"getAdditionalDatanode" API which returns some other datanode C
- DataStreamer sends data transfer (with generation stamp 3) to the new
block pipeline containing datanodes A & C, the append succeeds to both datanodes
- end state is that:
- datanodes A & C have live replicas with latest generation stamp 3
- datanode B has a corrupt replica because its lagging behind with
generation stamp 1
The key behavior being highlighted here is that when the DataStreamer client
shifts the block pipeline due to append failures on a subset of the datanodes
in the pipeline, a corrupt block replica gets leftover on the datanode that was
shifted away from.
This corrupt block replica makes the datanode ineligible as a replication
target for the block because of the following exception:
```
2021-06-06 10:38:23,604 INFO org.apache.hadoop.hdfs.server.datanode.DataNode
(DataXceiver for client at /DN2:45654 [Receiving block BP-YYY:blk_XXX]):
DN3:9866:DataXceiver error processing WRITE_BLOCK operation src: /DN2:45654
dst: /DN3:9866;
org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block
BP-YYY:blk_XXX already exists in state FINALIZED and thus cannot be created.
```
What typically will occur is that these corrupt block replicas will be
invalidated by the Namenode which will cause the corrupt replica to the be
deleted on the datanode, thus allowing the datanode to once again be a
replication target for the block. Note that the Namenode will not identify the
corrupt block replica until the datanode sends its next block report, this can
take up to 6 hours with the default block report interval.
As long as there is 1 live replica of the block, all the corrupt replicas
should be invalidated based on this condition:
https://github.com/apache/hadoop/blob/7bd7725532fd139d2e0e1662df7700f7ab95067a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java#L1928
When there are 0 live replicas the corrupt replicas are not invalidated, I
assume the reasoning behind this is that its better to have some corrupt copies
of the block then to have no copies at all.
#### Description of Problem
The problem comes into play when the aforementioned behavior is coupled with
decommissioning and/or entering maintenance.
Consider the following scenario:
- block has replication factor of 2
- there are 3 datanodes A, B, & C
- datanode A has decommissioning replica
- datanodes B & C have corrupt replicas
This scenario is essentially a decommissioning & replication deadlock
because:
- corrupt replicas on B & C will not be invalidated because there are 0 live
replicas (as per Namenode logic)
- datanode A cannot finish decommissioning until the block is replicated to
datanodes B & C
- the block cannot be replicated to datanodes B & C until their corrupt
replicas are invalidated
This does not need to be a deadlock, the corrupt replicas could be
invalidated & the live replica could be transferred from A to B & C.
The same problem can exist on a larger scale, the requirements are that:
- liveReplicas < minReplicationFactor (minReplicationFactor=1 by default)
- decommissioningAndEnteringMaintenanceReplicas > 0
- liveReplicas + decommissioningAndEnteringMaintenanceReplicas +
corruptReplicas = numberOfDatanodes
In this case the corrupt replicas will not be invalidated by the Namenode
which means that the decommissioning and entering maintenance replicas will
never be sufficiently replicated and therefore will never finish
decommissioning or entering maintenance.
The symptom of this issue in the logs is that right after a node with a
corrupt replica sends its block report, rather than the block being invalidated
it just gets counted as a corrupt replica:
```
TODO
```
#### Proposed Solution
Rather than checking if minReplicationSatisfied based on live replicas,
check based on usable replicas (i.e. live + decommissioning +
enteringMaintenance). This will allow the corrupt replicas to be invalidated &
the live replica(s) on the decommissioning node(s) to be sufficiently
replicated.
The only perceived risk here would be that the corrupt blocks are
invalidated at around the same time the decommissioning and entering
maintenance nodes are decommissioned. This could in theory bring the overall
number of replicas below the minReplicationFactor (to 0 in the worst case).
This is however not an actual risk because the decommissioning and entering
maintenance nodes will not finish decommissioning until they have a sufficient
number of live replicas; so there is no possibility that the decommissioning
and entering maintenance nodes will be decommissioned prematurely.
### How was this patch tested?
Added a unit test "testDeleteCorruptReplicaForUnderReplicatedBlock"
- TODO
### For code changes:
- [X] Does the title or this PR starts with the corresponding JIRA issue id
(e.g. 'HADOOP-17799. Your PR title ...')?
- [n/a] Object storage: have the integration tests been executed and the
endpoint declared according to the connector-specific documentation?
- [n/a] If adding new dependencies to the code, are these dependencies
licensed in a way that is compatible for inclusion under [ASF
2.0](http://www.apache.org/legal/resolved.html#category-a)?
- [n/a] If applicable, have you updated the `LICENSE`, `LICENSE-binary`,
`NOTICE-binary` files?
Issue Time Tracking
-------------------
Worklog Id: (was: 778639)
Remaining Estimate: 0h
Time Spent: 10m
> HDFS-721 causes DataNode decommissioning to get stuck indefinitely
> ------------------------------------------------------------------
>
> Key: HDFS-16064
> URL: https://issues.apache.org/jira/browse/HDFS-16064
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: datanode, namenode
> Affects Versions: 3.2.1
> Reporter: Kevin Wikant
> Priority: Major
> Time Spent: 10m
> Remaining Estimate: 0h
>
> Seems that https://issues.apache.org/jira/browse/HDFS-721 was resolved as a
> non-issue under the assumption that if the namenode & a datanode get into an
> inconsistent state for a given block pipeline, there should be another
> datanode available to replicate the block to
> While testing datanode decommissioning using "dfs.exclude.hosts", I have
> encountered a scenario where the decommissioning gets stuck indefinitely
> Below is the progression of events:
> * there are initially 4 datanodes DN1, DN2, DN3, DN4
> * scale-down is started by adding DN1 & DN2 to "dfs.exclude.hosts"
> * HDFS block pipelines on DN1 & DN2 must now be replicated to DN3 & DN4 in
> order to satisfy their minimum replication factor of 2
> * during this replication process
> https://issues.apache.org/jira/browse/HDFS-721 is encountered which causes
> the following inconsistent state:
> ** DN3 thinks it has the block pipeline in FINALIZED state
> ** the namenode does not think DN3 has the block pipeline
> {code:java}
> 2021-06-06 10:38:23,604 INFO org.apache.hadoop.hdfs.server.datanode.DataNode
> (DataXceiver for client at /DN2:45654 [Receiving block BP-YYY:blk_XXX]):
> DN3:9866:DataXceiver error processing WRITE_BLOCK operation src: /DN2:45654
> dst: /DN3:9866;
> org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block
> BP-YYY:blk_XXX already exists in state FINALIZED and thus cannot be created.
> {code}
> * the replication is attempted again, but:
> ** DN4 has the block
> ** DN1 and/or DN2 have the block, but don't count towards the minimum
> replication factor because they are being decommissioned
> ** DN3 does not have the block & cannot have the block replicated to it
> because of HDFS-721
> * the namenode repeatedly tries to replicate the block to DN3 & repeatedly
> fails, this continues indefinitely
> * therefore DN4 is the only live datanode with the block & the minimum
> replication factor of 2 cannot be satisfied
> * because the minimum replication factor cannot be satisfied for the
> block(s) being moved off DN1 & DN2, the datanode decommissioning can never be
> completed
> {code:java}
> 2021-06-06 10:39:10,106 INFO BlockStateChange (DatanodeAdminMonitor-0):
> Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0,
> decommissioned replicas: 0, decommissioning replicas: 2, maintenance
> replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is
> Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 ,
> Current Datanode: DN1:9866, Is current datanode decommissioning: true, Is
> current datanode entering maintenance: false
> ...
> 2021-06-06 10:57:10,105 INFO BlockStateChange (DatanodeAdminMonitor-0):
> Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0,
> decommissioned replicas: 0, decommissioning replicas: 2, maintenance
> replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is
> Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 ,
> Current Datanode: DN2:9866, Is current datanode decommissioning: true, Is
> current datanode entering maintenance: false
> {code}
> Being stuck in decommissioning state forever is not an intended behavior of
> DataNode decommissioning
> A few potential solutions:
> * Address the root cause of the problem which is an inconsistent state
> between namenode & datanode: https://issues.apache.org/jira/browse/HDFS-721
> * Detect when datanode decommissioning is stuck due to lack of available
> datanodes for satisfying the minimum replication factor, then recover by
> re-enabling the datanodes being decommissioned
>
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]