[jira] [Work logged] (HDFS-16064) HDFS-721 causes DataNode decommissioning to get stuck indefinitely

ASF GitHub Bot (Jira) Mon, 06 Jun 2022 07:23:04 -0700


     [ 
https://issues.apache.org/jira/browse/HDFS-16064?focusedWorklogId=778639&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778639
 ]


ASF GitHub Bot logged work on HDFS-16064:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 06/Jun/22 14:22
            Start Date: 06/Jun/22 14:22
    Worklog Time Spent: 10m 
      Work Description: KevinWikant opened a new pull request, #4410:
URL: https://github.com/apache/hadoop/pull/4410

   HDFS-16064. Determine when to invalidate corrupt replicas based on number of 
usable replicas
   
   ### Description of PR
   
   Bug fix for a re-occurring HDFS bug which can result in datanodes being 
unable to complete decommissioning indefinitely. In short, the bug is a chicken 
& egg problem where:
   - in order for a datanode to be decommissioned its blocks must be 
sufficiently replicated
   - datanode cannot sufficiently replicate some block(s) because of corrupt 
block replicas on target datanodes
   - corrupt block replicas will not be invalidated because the block(s) are 
not sufficiently replicated
   
   In this scenario, the block(s) are sufficiently replicated but the logic the 
Namenode uses to determine if a block is sufficiently replicated is flawed.
   
   To understand the bug further we must first establish some background 
information.
   
   #### Background Information
   
   Givens:
   - FSDataOutputStream is being used to write the HDFS file, under the hood 
this uses a class DataStreamer
   - for the sake of example we will say the HDFS file has a replication factor 
of 2, though this is not a requirement to reproduce the issue
   - the file is being appended to intermittently over an extended period of 
time (in general, this issue needs minutes/hours  to reproduce)
   - HDFS is configured with typical default configurations
   
   Under certain scenarios the DataStreamer client can detect a bad link when 
trying to append to the block pipeline, in this case the DataStreamer client 
can shift the block pipeline by replacing the bad link with a new datanode. 
When this happens the replica on the datanode that was shifted away from 
becomes corrupted because it no longer has the latest generation stamp for the 
block. As a more concrete example:
   - DataStreamer client creates block pipeline on datanodes A & B, each have a 
block replica with generation stamp 1
   - DataStreamer client tries to append the block pipeline by sending block 
transfer (with generation stamp 2) to datanode A
   - Datanode A succeeds in writing the block transfer & then attempts to 
forward the transfer to datanode B
   - Datanode B fails the transfer for some reason and responds with a 
PipelineAck failure code
   - Datanode A sends a PipelineAck to DataStreamer indicating datanode A 
succeeded in the append & datanode B failed in the append. The DataStreamer 
detects datanode B as a bad link which will be replaced before the next append 
operation
   - at this point datanode A has live replica with generation stamp 2 & 
datanode B has corrupt replica with generation stamp 1
   - the next time DataStreamer tries to append the block it will call Namenode 
"getAdditionalDatanode" API which returns some other datanode C
   - DataStreamer sends data transfer (with generation stamp 3) to the new 
block pipeline containing datanodes A & C, the append succeeds to both datanodes
   - end state is that:
     - datanodes A & C have live replicas with latest generation stamp 3
     - datanode B has a corrupt replica because its lagging behind with 
generation stamp 1
   
   The key behavior being highlighted here is that when the DataStreamer client 
shifts the block pipeline due to append failures on a subset of the datanodes 
in the pipeline, a corrupt block replica gets leftover on the datanode that was 
shifted away from.
   
   This corrupt block replica makes the datanode ineligible as a replication 
target for the block because of the following exception:
   
   ```
   2021-06-06 10:38:23,604 INFO org.apache.hadoop.hdfs.server.datanode.DataNode 
(DataXceiver for client  at /DN2:45654 [Receiving block BP-YYY:blk_XXX]): 
DN3:9866:DataXceiver error processing WRITE_BLOCK operation  src: /DN2:45654 
dst: /DN3:9866; 
org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block 
BP-YYY:blk_XXX already exists in state FINALIZED and thus cannot be created.
   ```
   
   What typically will occur is that these corrupt block replicas will be 
invalidated by the Namenode which will cause the corrupt replica to the be 
deleted on the datanode, thus allowing the datanode to once again be a 
replication target for the block. Note that the Namenode will not identify the 
corrupt block replica until the datanode sends its next block report, this can 
take up to 6 hours with the default block report interval.
   
   As long as there is 1 live replica of the block, all the corrupt replicas 
should be invalidated based on this condition: 
https://github.com/apache/hadoop/blob/7bd7725532fd139d2e0e1662df7700f7ab95067a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java#L1928
   
   When there are 0 live replicas the corrupt replicas are not invalidated, I 
assume the reasoning behind this is that its better to have some corrupt copies 
of the block then to have no copies at all.
   
   #### Description of Problem
   
   The problem comes into play when the aforementioned behavior is coupled with 
decommissioning and/or entering maintenance.
   
   Consider the following scenario:
   - block has replication factor of 2
   - there are 3 datanodes A, B, & C
   - datanode A has decommissioning replica
   - datanodes B & C have corrupt replicas
   
   This scenario is essentially a decommissioning & replication deadlock 
because:
   - corrupt replicas on B & C will not be invalidated because there are 0 live 
replicas (as per Namenode logic)
   - datanode A cannot finish decommissioning until the block is replicated to 
datanodes B & C
   - the block cannot be replicated to datanodes B & C until their corrupt 
replicas are invalidated
   
   This does not need to be a deadlock, the corrupt replicas could be 
invalidated & the live replica could be transferred from A to B & C.
   
   The same problem can exist on a larger scale, the requirements are that:
   - liveReplicas < minReplicationFactor (minReplicationFactor=1 by default)
   - decommissioningAndEnteringMaintenanceReplicas > 0
   - liveReplicas + decommissioningAndEnteringMaintenanceReplicas + 
corruptReplicas = numberOfDatanodes
   
   In this case the corrupt replicas will not be invalidated by the Namenode 
which means that the decommissioning and entering maintenance replicas will 
never be sufficiently replicated and therefore will never finish 
decommissioning or entering maintenance.
   
   The symptom of this issue in the logs is that right after a node with a 
corrupt replica sends its block report, rather than the block being invalidated 
it just gets counted as a corrupt replica:
   
   ```
   TODO
   ```
   
   #### Proposed Solution
   
   Rather than checking if minReplicationSatisfied based on live replicas, 
check based on usable replicas (i.e. live + decommissioning + 
enteringMaintenance). This will allow the corrupt replicas to be invalidated & 
the live replica(s) on the decommissioning node(s) to be sufficiently 
replicated.
   
   The only perceived risk here would be that the corrupt blocks are 
invalidated at around the same time the decommissioning and entering 
maintenance nodes are decommissioned. This could in theory bring the overall 
number of replicas below the minReplicationFactor (to 0 in the worst case). 
This is however not an actual risk because the decommissioning and entering 
maintenance nodes will not finish decommissioning until they have a sufficient 
number of live replicas; so there is no possibility that the decommissioning 
and entering maintenance nodes will be decommissioned prematurely.
   
   ### How was this patch tested?
   
   Added a unit test "testDeleteCorruptReplicaForUnderReplicatedBlock"
   
   - TODO
   
   ### For code changes:
   
   - [X] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   - [n/a] Object storage: have the integration tests been executed and the 
endpoint declared according to the connector-specific documentation?
   - [n/a] If adding new dependencies to the code, are these dependencies 
licensed in a way that is compatible for inclusion under [ASF 
2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [n/a] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, 
`NOTICE-binary` files?
   
   




Issue Time Tracking
-------------------

            Worklog Id:     (was: 778639)
    Remaining Estimate: 0h
            Time Spent: 10m

> HDFS-721 causes DataNode decommissioning to get stuck indefinitely
> ------------------------------------------------------------------
>
>                 Key: HDFS-16064
>                 URL: https://issues.apache.org/jira/browse/HDFS-16064
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode, namenode
>    Affects Versions: 3.2.1
>            Reporter: Kevin Wikant
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Seems that https://issues.apache.org/jira/browse/HDFS-721 was resolved as a 
> non-issue under the assumption that if the namenode & a datanode get into an 
> inconsistent state for a given block pipeline, there should be another 
> datanode available to replicate the block to
> While testing datanode decommissioning using "dfs.exclude.hosts", I have 
> encountered a scenario where the decommissioning gets stuck indefinitely
> Below is the progression of events:
>  * there are initially 4 datanodes DN1, DN2, DN3, DN4
>  * scale-down is started by adding DN1 & DN2 to "dfs.exclude.hosts"
>  * HDFS block pipelines on DN1 & DN2 must now be replicated to DN3 & DN4 in 
> order to satisfy their minimum replication factor of 2
>  * during this replication process 
> https://issues.apache.org/jira/browse/HDFS-721 is encountered which causes 
> the following inconsistent state:
>  ** DN3 thinks it has the block pipeline in FINALIZED state
>  ** the namenode does not think DN3 has the block pipeline
> {code:java}
> 2021-06-06 10:38:23,604 INFO org.apache.hadoop.hdfs.server.datanode.DataNode 
> (DataXceiver for client  at /DN2:45654 [Receiving block BP-YYY:blk_XXX]): 
> DN3:9866:DataXceiver error processing WRITE_BLOCK operation  src: /DN2:45654 
> dst: /DN3:9866; 
> org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block 
> BP-YYY:blk_XXX already exists in state FINALIZED and thus cannot be created.
> {code}
>  * the replication is attempted again, but:
>  ** DN4 has the block
>  ** DN1 and/or DN2 have the block, but don't count towards the minimum 
> replication factor because they are being decommissioned
>  ** DN3 does not have the block & cannot have the block replicated to it 
> because of HDFS-721
>  * the namenode repeatedly tries to replicate the block to DN3 & repeatedly 
> fails, this continues indefinitely
>  * therefore DN4 is the only live datanode with the block & the minimum 
> replication factor of 2 cannot be satisfied
>  * because the minimum replication factor cannot be satisfied for the 
> block(s) being moved off DN1 & DN2, the datanode decommissioning can never be 
> completed 
> {code:java}
> 2021-06-06 10:39:10,106 INFO BlockStateChange (DatanodeAdminMonitor-0): 
> Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, 
> decommissioned replicas: 0, decommissioning replicas: 2, maintenance 
> replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is 
> Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , 
> Current Datanode: DN1:9866, Is current datanode decommissioning: true, Is 
> current datanode entering maintenance: false
> ...
> 2021-06-06 10:57:10,105 INFO BlockStateChange (DatanodeAdminMonitor-0): 
> Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, 
> decommissioned replicas: 0, decommissioning replicas: 2, maintenance 
> replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is 
> Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , 
> Current Datanode: DN2:9866, Is current datanode decommissioning: true, Is 
> current datanode entering maintenance: false
> {code}
> Being stuck in decommissioning state forever is not an intended behavior of 
> DataNode decommissioning
> A few potential solutions:
>  * Address the root cause of the problem which is an inconsistent state 
> between namenode & datanode: https://issues.apache.org/jira/browse/HDFS-721
>  * Detect when datanode decommissioning is stuck due to lack of available 
> datanodes for satisfying the minimum replication factor, then recover by 
> re-enabling the datanodes being decommissioned
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Work logged] (HDFS-16064) HDFS-721 causes DataNode decommissioning to get stuck indefinitely

Reply via email to