[jira] [Comment Edited] (HDFS-17658) HDFS decommissioning does not consider if Under Construction blocks are sufficiently replicated which causes HDFS Data Loss

Kevin Wikant (Jira) Tue, 05 Nov 2024 12:00:38 -0800


    [ 
https://issues.apache.org/jira/browse/HDFS-17658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17895772#comment-17895772
 ]


Kevin Wikant edited comment on HDFS-17658 at 11/5/24 7:58 PM:
--------------------------------------------------------------

h2. Test#1: Create Block & Repeatedly Append -> Decommission Datanode 
(dfs.replication = 1)
 * Create a new HDFS cluster (with 0 blocks to begin with)
 * Run the test script to execute the desired test case

 
{quote}2024-11-05 14:32:16 Append Success  100 KB
2024-11-05 14:32:36 Append Success  200 KB
2024-11-05 14:32:56 Append Success  300 KB
2024-11-05 14:33:16 Append Success  400 KB
2024-11-05 14:33:36 Append Success  500 KB
2024-11-05 14:33:56 Append Success  600 KB
2024-11-05 14:34:16 Append Success  700 KB
2024-11-05 14:34:36 Append Success  800 KB
2024-11-05 14:34:56 Append Success  900 KB
2024-11-05 14:35:16 Append Success  1000 KB
2024-11-05 14:35:36 Append Success  1100 KB
2024-11-05 14:35:56 Append Success  1200 KB
2024-11-05 14:36:16 Append Success  1300 KB
2024-11-05 14:36:36 Append Success  1400 KB
2024-11-05 14:36:56 Append Success  1500 KB
2024-11-05 14:37:16 Append Success  1600 KB
2024-11-05 14:37:36 Append Success  1700 KB
2024-11-05 14:37:56 Append Success  1800 KB
2024-11-05 14:38:16 Append Success  1900 KB
2024-11-05 14:38:36 Append Success  2000 KB
2024-11-05 14:38:56 Append Failed   2000 KB
java.io.IOException: All datanodes 
[DatanodeInfoWithStorage[172.31.93.123:9866,DS-bb1d316c-a47a-4a3a-bf6e-0781945a50d1,DISK]]
 are bad. Aborting...
        at 
org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1609)
        at 
org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1543)
        at 
org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1529)
        at 
org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1305)
        at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:668)
2024-11-05 14:39:16 Append Failed   2000 KB
java.io.IOException: All datanodes 
[DatanodeInfoWithStorage[172.31.93.123:9866,DS-bb1d316c-a47a-4a3a-bf6e-0781945a50d1,DISK]]
 are bad. Aborting...
        at 
org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1609)
        at 
org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1543)
        at 
org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1529)
        at 
org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1305)
        at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:668)
2024-11-05 14:39:36 Append Failed   2000 KB
java.io.IOException: All datanodes 
[DatanodeInfoWithStorage[172.31.93.123:9866,DS-bb1d316c-a47a-4a3a-bf6e-0781945a50d1,DISK]]
 are bad. Aborting...
        at 
org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1609)
        at 
org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1543)
        at 
org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1529)
        at 
org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1305)
        at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:668)
2024-11-05 14:39:56 Append Failed   2000 KB
...
{quote} * Check the namenode logs to identify which datanode(s) have the block 
replica(s) 

 

 
{quote}2024-11-05 14:32:16,211 INFO hdfs.StateChange: BLOCK* allocate 
blk_1073741825_1001, replicas=172.31.93.123:9866 for 
/tmp/testfile-5f56dacf-b550-48ea-8f19-7a44bb0cb8dd
{quote} * Start decommissioning all the datanode(s) with block replica(s)

 
 * Check the namenode logs to check if the block replica(s) are considered by 
the DatanodeAdminManager & are replicated to other datanode(s) before the 
containing datanode(s) become decommissioned

 
{quote}2024-11-05 14:36:02,805 INFO blockmanagement.DatanodeAdminManager: 
Starting decommission of 172.31.93.123:9866 
[DISK]DS-bb1d316c-a47a-4a3a-bf6e-0781945a50d1:NORMAL:172.31.93.123:9866 with 0 
blocks
2024-11-05 14:36:02,805 INFO blockmanagement.DatanodeAdminManager: Starting 
decommission of 172.31.93.123:9866 
[DISK]DS-95cfd2fa-25b0-4b20-aacf-e259201cf2eb:NORMAL:172.31.93.123:9866 with 0 
blocks
2024-11-05 14:36:14,360 INFO blockmanagement.DatanodeAdminDefaultMonitor: 
Remove any stale pending Node same as added node 172.31.93.123:9866 with UUID 
7e4f2032-3e68-4ec7-92eb-ff3d08f0ea57 which is in state Decommission In Progress 
and isAlive is true
2024-11-05 14:36:14,362 INFO blockmanagement.DatanodeAdminManager: 
Decommissioning complete for node 172.31.93.123:9866
2024-11-05 14:36:14,362 DEBUG blockmanagement.DatanodeAdminDefaultMonitor: Node 
172.31.93.123:9866 is sufficiently replicated and healthy, marked as 
Decommissioned.
{quote} * Terminate the decommissioned datanode(s). Then validate if there was 
any HDFS data loss

 

HDFS file cannot be read because all block replicas are lost. Also, the HDFS 
write begins to fail with "{_}All datanodes are bad{_}"
h3. DEBUG Logs
 * blk_1073741825_1001 (create & do not close -> RECEIVING_BLOCK)

 
{quote}2024-11-05 14:32:16,211 DEBUG hdfs.StateChange: DIR* 
FSDirectory.addBlock: /tmp/testfile-5f56dacf-b550-48ea-8f19-7a44bb0cb8dd with 
blk_1073741825_1001 block is added to the in-memory file system
2024-11-05 14:32:16,211 INFO hdfs.StateChange: BLOCK* allocate 
blk_1073741825_1001, replicas=172.31.93.123:9866 for 
/tmp/testfile-5f56dacf-b550-48ea-8f19-7a44bb0cb8dd
2024-11-05 14:32:16,211 DEBUG namenode.FSEditLog: logEdit [RpcEdit 
op:AddBlockOp [path=/tmp/testfile-5f56dacf-b550-48ea-8f19-7a44bb0cb8dd, 
penultimateBlock=NULL, lastBlock=blk_1073741825_1001, RpcClientId=, 
RpcCallId=-2] call:Call#1 Retry#0 
org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 
ip-172-31-88-29.ec2.internal:51428 / 172.31.88.29:51428
2024-11-05 14:32:16,211 DEBUG hdfs.StateChange: persistNewBlock: 
/tmp/testfile-5f56dacf-b550-48ea-8f19-7a44bb0cb8dd with new block 
blk_1073741825_1001, current total block count is 1
2024-11-05 14:32:16,211 DEBUG namenode.FSEditLog: doEditTx() op=AddBlockOp 
[path=/tmp/testfile-5f56dacf-b550-48ea-8f19-7a44bb0cb8dd, 
penultimateBlock=NULL, lastBlock=blk_1073741825_1001, RpcClientId=, 
RpcCallId=-2] txid=57
2024-11-05 14:32:16,461 DEBUG namenode.FSEditLog: logEdit [RpcEdit 
op:UpdateBlocksOp [path=/tmp/testfile-5f56dacf-b550-48ea-8f19-7a44bb0cb8dd, 
blocks=[blk_1073741825_1001], RpcClientId=, RpcCallId=-2] call:Call#3 Retry#0 
org.apache.hadoop.hdfs.protocol.ClientProtocol.fsync from 
ip-172-31-88-29.ec2.internal:51428 / 172.31.88.29:51428]
2024-11-05 14:32:16,461 DEBUG namenode.FSEditLog: logSync [RpcEdit 
op:UpdateBlocksOp [path=/tmp/testfile-5f56dacf-b550-48ea-8f19-7a44bb0cb8dd, 
blocks=[blk_1073741825_1001], RpcClientId=, RpcCallId=-2] call:Call#3 Retry#0 
org.apache.hadoop.hdfs.protocol.ClientProtocol.fsync from 
ip-172-31-88-29.ec2.internal:51428 / 172.31.88.29:51428]
2024-11-05 14:32:16,461 DEBUG namenode.FSEditLog: doEditTx() op=UpdateBlocksOp 
[path=/tmp/testfile-5f56dacf-b550-48ea-8f19-7a44bb0cb8dd, 
blocks=[blk_1073741825_1001], RpcClientId=, RpcCallId=-2] txid=58
2024-11-05 14:32:19,206 DEBUG blockmanagement.BlockManager: Reported block 
blk_1073741825_1001 on 172.31.93.123:9866 size 268435456 replicaState = RBW
2024-11-05 14:32:19,206 DEBUG blockmanagement.BlockManager: Reported block 
blk_1073741825_1001 on 172.31.93.123:9866 size 268435456 replicaState = RBW
2024-11-05 14:32:19,206 DEBUG BlockStateChange: BLOCK* block RECEIVING_BLOCK: 
blk_1073741825_1001 is received from 172.31.93.123:9866
{quote}
 
 * blk_1073741825_1001 (decommissioning)
 ** no logs, open block not considered in decommissioning


was (Author: kevinwikant):
h2. Test#1: Create Block & Repeatedly Append -> Decommission Datanode 
(dfs.replication = 1)
 * Create a new HDFS cluster (with 0 blocks to begin with)
 * Run the test script to execute the desired test case

 
{quote}2024-11-05 14:32:16 Append Success  100 KB
2024-11-05 14:32:36 Append Success  200 KB
2024-11-05 14:32:56 Append Success  300 KB
2024-11-05 14:33:16 Append Success  400 KB
2024-11-05 14:33:36 Append Success  500 KB
2024-11-05 14:33:56 Append Success  600 KB
2024-11-05 14:34:16 Append Success  700 KB
2024-11-05 14:34:36 Append Success  800 KB
2024-11-05 14:34:56 Append Success  900 KB
2024-11-05 14:35:16 Append Success  1000 KB
2024-11-05 14:35:36 Append Success  1100 KB
2024-11-05 14:35:56 Append Success  1200 KB
2024-11-05 14:36:16 Append Success  1300 KB
2024-11-05 14:36:36 Append Success  1400 KB
2024-11-05 14:36:56 Append Success  1500 KB
2024-11-05 14:37:16 Append Success  1600 KB
2024-11-05 14:37:36 Append Success  1700 KB
2024-11-05 14:37:56 Append Success  1800 KB
2024-11-05 14:38:16 Append Success  1900 KB
2024-11-05 14:38:36 Append Success  2000 KB
2024-11-05 14:38:56 Append Failed   2000 KB
java.io.IOException: All datanodes 
[DatanodeInfoWithStorage[172.31.93.123:9866,DS-bb1d316c-a47a-4a3a-bf6e-0781945a50d1,DISK]]
 are bad. Aborting...
        at 
org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1609)
        at 
org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1543)
        at 
org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1529)
        at 
org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1305)
        at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:668)
2024-11-05 14:39:16 Append Failed   2000 KB
java.io.IOException: All datanodes 
[DatanodeInfoWithStorage[172.31.93.123:9866,DS-bb1d316c-a47a-4a3a-bf6e-0781945a50d1,DISK]]
 are bad. Aborting...
        at 
org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1609)
        at 
org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1543)
        at 
org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1529)
        at 
org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1305)
        at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:668)
2024-11-05 14:39:36 Append Failed   2000 KB
java.io.IOException: All datanodes 
[DatanodeInfoWithStorage[172.31.93.123:9866,DS-bb1d316c-a47a-4a3a-bf6e-0781945a50d1,DISK]]
 are bad. Aborting...
        at 
org.apache.hadoop.hdfs.DataStreamer.handleBadDatanode(DataStreamer.java:1609)
        at 
org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1543)
        at 
org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1529)
        at 
org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1305)
        at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:668)
2024-11-05 14:39:56 Append Failed   2000 KB
...
{quote}
* Check the namenode logs to identify which datanode(s) have the block 
replica(s) 

 

 
{quote}2024-11-05 14:32:16,211 INFO hdfs.StateChange: BLOCK* allocate 
blk_1073741825_1001, replicas=172.31.93.123:9866 for 
/tmp/testfile-5f56dacf-b550-48ea-8f19-7a44bb0cb8dd
{quote}
* Start decommissioning all the datanode(s) with block replica(s)

 
 * Check the namenode logs to check if the block replica(s) are considered by 
the DatanodeAdminManager & are replicated to other datanode(s) before the 
containing datanode(s) become decommissioned

 
{quote}2024-11-05 14:36:02,805 INFO blockmanagement.DatanodeAdminManager: 
Starting decommission of 172.31.93.123:9866 
[DISK]DS-bb1d316c-a47a-4a3a-bf6e-0781945a50d1:NORMAL:172.31.93.123:9866 with 0 
blocks
2024-11-05 14:36:02,805 INFO blockmanagement.DatanodeAdminManager: Starting 
decommission of 172.31.93.123:9866 
[DISK]DS-95cfd2fa-25b0-4b20-aacf-e259201cf2eb:NORMAL:172.31.93.123:9866 with 0 
blocks
2024-11-05 14:36:14,360 INFO blockmanagement.DatanodeAdminDefaultMonitor: 
Remove any stale pending Node same as added node 172.31.93.123:9866 with UUID 
7e4f2032-3e68-4ec7-92eb-ff3d08f0ea57 which is in state Decommission In Progress 
and isAlive is true
2024-11-05 14:36:14,362 INFO blockmanagement.DatanodeAdminManager: 
Decommissioning complete for node 172.31.93.123:9866
2024-11-05 14:36:14,362 DEBUG blockmanagement.DatanodeAdminDefaultMonitor: Node 
172.31.93.123:9866 is sufficiently replicated and healthy, marked as 
Decommissioned.
{quote}
* Terminate the decommissioned datanode(s). Then validate if there was any HDFS 
data loss

 

HDFS file cannot be read because all block replicas are lost. Also, the HDFS 
write begins to fail with "{_}All datanodes are bad{_}"
h3. DEBUG Logs
 * blk_1073741825_1001 (creation)

{quote}2024-11-05 14:32:16,211 DEBUG hdfs.StateChange: DIR* 
FSDirectory.addBlock: /tmp/testfile-5f56dacf-b550-48ea-8f19-7a44bb0cb8dd with 
blk_1073741825_1001 block is added to the in-memory file system
2024-11-05 14:32:16,211 INFO hdfs.StateChange: BLOCK* allocate 
blk_1073741825_1001, replicas=172.31.93.123:9866 for 
/tmp/testfile-5f56dacf-b550-48ea-8f19-7a44bb0cb8dd
2024-11-05 14:32:16,211 DEBUG namenode.FSEditLog: logEdit [RpcEdit 
op:AddBlockOp [path=/tmp/testfile-5f56dacf-b550-48ea-8f19-7a44bb0cb8dd, 
penultimateBlock=NULL, lastBlock=blk_1073741825_1001, RpcClientId=, 
RpcCallId=-2] call:Call#1 Retry#0 
org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from 
ip-172-31-88-29.ec2.internal:51428 / 172.31.88.29:51428
2024-11-05 14:32:16,211 DEBUG hdfs.StateChange: persistNewBlock: 
/tmp/testfile-5f56dacf-b550-48ea-8f19-7a44bb0cb8dd with new block 
blk_1073741825_1001, current total block count is 1
2024-11-05 14:32:16,211 DEBUG namenode.FSEditLog: doEditTx() op=AddBlockOp 
[path=/tmp/testfile-5f56dacf-b550-48ea-8f19-7a44bb0cb8dd, 
penultimateBlock=NULL, lastBlock=blk_1073741825_1001, RpcClientId=, 
RpcCallId=-2] txid=57
2024-11-05 14:32:16,461 DEBUG namenode.FSEditLog: logEdit [RpcEdit 
op:UpdateBlocksOp [path=/tmp/testfile-5f56dacf-b550-48ea-8f19-7a44bb0cb8dd, 
blocks=[blk_1073741825_1001], RpcClientId=, RpcCallId=-2] call:Call#3 Retry#0 
org.apache.hadoop.hdfs.protocol.ClientProtocol.fsync from 
ip-172-31-88-29.ec2.internal:51428 / 172.31.88.29:51428]
2024-11-05 14:32:16,461 DEBUG namenode.FSEditLog: logSync [RpcEdit 
op:UpdateBlocksOp [path=/tmp/testfile-5f56dacf-b550-48ea-8f19-7a44bb0cb8dd, 
blocks=[blk_1073741825_1001], RpcClientId=, RpcCallId=-2] call:Call#3 Retry#0 
org.apache.hadoop.hdfs.protocol.ClientProtocol.fsync from 
ip-172-31-88-29.ec2.internal:51428 / 172.31.88.29:51428]
2024-11-05 14:32:16,461 DEBUG namenode.FSEditLog: doEditTx() op=UpdateBlocksOp 
[path=/tmp/testfile-5f56dacf-b550-48ea-8f19-7a44bb0cb8dd, 
blocks=[blk_1073741825_1001], RpcClientId=, RpcCallId=-2] txid=58
2024-11-05 14:32:19,206 DEBUG blockmanagement.BlockManager: Reported block 
blk_1073741825_1001 on 172.31.93.123:9866 size 268435456 replicaState = RBW
2024-11-05 14:32:19,206 DEBUG blockmanagement.BlockManager: Reported block 
blk_1073741825_1001 on 172.31.93.123:9866 size 268435456 replicaState = RBW
2024-11-05 14:32:19,206 DEBUG BlockStateChange: BLOCK* block RECEIVING_BLOCK: 
blk_1073741825_1001 is received from 172.31.93.123:9866
{quote} * blk_1073741825_1001 (decommissioning)
 ** no logs

> HDFS decommissioning does not consider if Under Construction blocks are 
> sufficiently replicated which causes HDFS Data Loss
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-17658
>                 URL: https://issues.apache.org/jira/browse/HDFS-17658
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>    Affects Versions: 3.4.0
>            Reporter: Kevin Wikant
>            Priority: Major
>
> h2. Background
> The HDFS Namenode manages datanode decommissioning using the 
> DatanodeAdminManager
> The DatanodeAdminManager has logic to prevent transitioning datanodes to 
> decommissioned state if they contain open blocks (i.e. Under Construction 
> blocks) which are not sufficiently replicated to other datanodes. In the 
> context of decomissioning, the reason that open blocks are important is 
> because they cannot be replicated to other datanodes given they are actively 
> being appended by an HDFS client. Because open blocks cannot be moved during 
> decommissioning, they should prevent the associated datanodes from becoming 
> decommissioned until the block is completed/finalized and it can be safely 
> moved to other live datanode(s).
>  
> h2. Problem
> The logic for DatanodeAdminManager to avoid decommissioning datanodes which 
> contain open blocks does not properly consider blocks which are in Under 
> Construction state. The DatanodeAdminManager is only considering blocks which 
> are in committed/finalized state. For reference:
>  * a block which is actively being appended by a DFSOutputStream is in Under 
> Construction state
>  * only when a DFSOutputStream is closed does the block transition from Under 
> Construction state to Committed/Finalized state
>  * then later when the block is reported to the namenode in a block report, 
> it will transition from Committed to Completed state
> Furthermore:
>  * this is true for a new file/block that was just created via 
> "DFSClient.create"
>  * this is true for an existing file/block that was just opened via 
> "DFSClient.append"
>  * this is true for all different dfs.replication factor values
>  
> h2. Impact
> In some environments, a datanode being decommissioned is taken as a signal 
> that all the blocks on that datanodes are sufficiently replicated to other 
> live datanodes & therefore it is safe to terminate the underlying virtual 
> host running the datanode.
> These types of environments can be impacted by HDFS data loss for open blocks 
> which are not considered in the datanode decommissioning process. Since open 
> blocks are not considered in determining if a datanode can be decommissioned, 
> a datanode may become decommissioned before its open blocks are replicated to 
> other datanodes. If the decommissioned datanode is then terminated, the open 
> blocks will be lost.
> For open blocks with replication of 1, it takes a single 
> decommissioned/terminated datanode to cause data loss. For open blocks with 
> greater replication, all the datanodes holding the open block must be 
> decommissioned/terminated for there to be data loss.
> I would also break the impact down into 2 cases:
>  * for a new HDFS block that has never been closed, arguably if the 
> DFSOutputStream encounters failure then the client should be able to replay 
> the data from source
>  * for an existing HDFS block that has previously been closed, when this 
> block is opened via a new DFSOutputStream the block is susceptible to being 
> lost & the client cannot replay data which was appended in the past by a 
> different client. This is arguably the worse case of impact.
>  
> h2. Testing
> This behaviour has been verified via testing on Hadoop 3.4.0; however, I 
> suspect it also applies to many other older/newer Hadoop versions.
> See JIRA comments for detailed test methodology & results.
> The following is a summary of the test cases & test results:
> {quote}*Test#1: Create Block & Repeatedly Append in Loop → Decommission 
> Datanode*
> ^ Expectation: block is considered during decommissioning; datanode is not 
> decommissioned until the write operation is finished & block is replicated to 
> another datanode.
> ^ Observation: block is not considered during decommissioning & is lost when 
> decommissioned data is terminated.
> *Test#2: Create Block & Repeatedly Append in Loop → Close DFSOutputStream → 
> Decommission Datanode*
> ^ Expectation: block is considered during decommissioning & is replicated to 
> another datanode as part of decommissioning.
> ^ Observation: block is considered during decommissioning & is replicated to 
> another datanode as part of decommissioning.
> *Test#3: Create Block & Repeatedly Append in Loop → Close DFSOutputStream → 
> Re-open Block & Repeatedly Append in Loop → Decommission Datanode*
> ^ Expectation: block is considered during decommissioning; datanode is not 
> decommissioned until the write operation is finished & block is replicated to 
> another datanode.
> ^ Observation: block is not considered during decommissioning & is lost when 
> decommissioned data is terminated.
> *Test#4: Create Block & Repeatedly Append in Loop → Close DFSOutputStream → 
> Re-open Block & Repeatedly Append in Loop → Close DFSOutputStream → 
> Decommission Datanode*
> ^ Expectation: block is considered during decommissioning & is replicated to 
> another datanode as part of decommissioning.
> ^ Observation: block is not considered during decommissioning & is lost when 
> decommissioned data is terminated.
> {quote}
> These were all tested with replication factor of 1 & 2, observation is the 
> same in both cases.
>  
> h2. Root Cause Theory
> The DatanodeAdminManager relies on the DatanodeDescriptor StorageInfos to 
> identify which blocks to consider as part of wether or not a datanode can be 
> decommissioned.
> Based on an examination of the HDFS Namenode & Datanode DEBUG logs during 
> testing, I believe the root cause has to do with the following 2 behaviours:
> {{*1.* When a DFSOutputStream is created for a block, that block enters Under 
> Construction state but the Namenode does not add Under Construction blocks to 
> the StorageInfos unless they have been committed/finalized which only occurs 
> when the DFSOutputStream is closed. Therefore, by checking the StorageInfos 
> only, the DatanodeAdminManager is not actually checking Under Construction 
> blocks.}}
> During the testing, we see that the Under Construction blocks are not in the 
> StorageInfos for the corresponding DatanodeDescriptor:
> {quote}2024-11-05 14:32:19,206 DEBUG BlockStateChange: BLOCK* block 
> RECEIVING_BLOCK: blk_1073741825_1001 is received from 172.31.93.123:9866
> ...
> 2024-11-05 14:36:02,805 INFO blockmanagement.DatanodeAdminManager: Starting 
> decommission of 172.31.93.123:9866 
> [DISK]DS-bb1d316c-a47a-4a3a-bf6e-0781945a50d1:NORMAL:172.31.93.123:9866 with 
> 0 blocks
> 2024-11-05 14:36:02,805 INFO blockmanagement.DatanodeAdminManager: Starting 
> decommission of 172.31.93.123:9866 
> [DISK]DS-95cfd2fa-25b0-4b20-aacf-e259201cf2eb:NORMAL:172.31.93.123:9866 with 
> 0 blocks
> {quote}
> Whereas, if the DFSOutputStream is closed, we see the block is included in 
> the StorageInfos:
> {quote}2024-11-05 14:49:49,563 DEBUG BlockStateChange: BLOCK* block 
> RECEIVED_BLOCK: blk_1073741825_1001 is received from 172.31.95.159:9866
> ...
> 2024-11-05 14:52:36,770 INFO blockmanagement.DatanodeAdminManager: Starting 
> decommission of 172.31.95.159:9866 
> [DISK]DS-07f2256b-0cbc-4b5e-9c6c-9108650ee896:NORMAL:172.31.95.159:9866 with 
> 0 blocks
> 2024-11-05 14:52:36,770 INFO blockmanagement.DatanodeAdminManager: Starting 
> decommission of 172.31.95.159:9866 
> [DISK]DS-91919280-df53-4dac-b983-0eb12c81e4bf:NORMAL:172.31.95.159:9866 with 
> 1 blocks
> 2024-11-05 14:52:48,231 INFO BlockStateChange: Block: blk_1073741825_1001, 
> Expected Replicas: 1, live replicas: 0, corrupt replicas: 0, decommissioned 
> replicas: 0, decommissioning replicas: 1, maintenance replicas: 0, live 
> entering maintenance replicas: 0, replicas on stale nodes:0, readonly 
> replicas: 0, excess replicas: 0, Is Open File: false, Datanodes having this 
> block: 172.31.95.159:9866 , Current Datanode: 172.31.95.159:9866, Is current 
> datanode decommissioning: true, Is current datanode entering maintenance: 
> false
> {quote}
> I think this behaviour might be related to the following code which has been 
> in Hadoop for the past 10+ years: 
> [https://github.com/apache/hadoop/blob/51ebc3c20e8ae7d4dced41cdd2f52715aea604cc/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java#L3708]
>  
>  
> {{*2.* When a DFSOutputStream is created for an existing block, the Namenode 
> marks the existing block with previous generation stamp as stale & removes it 
> from the StorageInfos.}}
> We can see the following DEBUG logs during repro testing:
> {quote}2024-11-05 15:47:24,131 INFO namenode.FSNamesystem: 
> updatePipeline(blk_1073741825_1001, newGS=1002, newLength=307200, 
> newNodes=[172.31.95.208:9866], client=DFSClient_NONMAPREDUCE_-1408310900_1)
> 2024-11-05 15:47:24,132 DEBUG BlockStateChange: BLOCK* Removing stale replica 
> ReplicaUC[[DISK]DS-1bd20290-4662-4e5e-af0b-9b644d78d4f8:NORMAL:172.31.95.208:9866|RBW]
>  of blk_1073741825_1001
> 2024-11-05 15:47:24,132 INFO namenode.FSNamesystem: 
> updatePipeline(blk_1073741825_1001 => blk_1073741825_1002) success
> {quote}
> Tracing the code from the logs we see where this occurs:
>  * 
> [https://github.com/apache/hadoop/blob/f7651e2f63ddba9ed1ae4052e38464f85dd445f0/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java#L4476]
>  * 
> [https://github.com/apache/hadoop/blob/f7651e2f63ddba9ed1ae4052e38464f85dd445f0/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java#L4433]
>  * 
> [https://github.com/apache/hadoop/blob/f7651e2f63ddba9ed1ae4052e38464f85dd445f0/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlocksMap.java#L202]
>  
>  
> h3. Potential Solution
> If possible, can consider adding Under Construction blocks to StorageInfos. 
> If this would have other adverse impacts, then another solution is to expose 
> the Under Construction blocks to the DatanodeAdminManager via another 
> separate data structure.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (HDFS-17658) HDFS decommissioning does not consider if Under Construction blocks are sufficiently replicated which causes HDFS Data Loss

Reply via email to