[ https://issues.apache.org/jira/browse/HDFS-17658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17895778#comment-17895778 ]
Kevin Wikant edited comment on HDFS-17658 at 11/5/24 7:27 PM: -------------------------------------------------------------- h2. Test#2: Create Block & Repeatedly Append -> Close DFSOutputStream -> Decommission Datanode (dfs.replication = 1) * Create a new HDFS cluster (with 0 blocks to begin with) * Run the test script to execute the desired test case {quote}2024-11-05 14:49:49 Append Success 100 KB 2024-11-05 14:49:49 Append Success 200 KB 2024-11-05 14:49:49 Append Success 300 KB {quote} * Check the namenode logs to identify which datanode(s) have the block replica(s) {quote}2024-11-05 14:49:49,346 INFO hdfs.StateChange: BLOCK* allocate blk_1073741825_1001, replicas=172.31.95.159:9866 for /tmp/testfile-587e1ad8-f8db-4b3a-b4a5-74f26dbda4af {quote} * Start decommissioning all the datanode(s) with block replica(s) * Check the namenode logs to check if the block replica(s) are considered by the DatanodeAdminManager & are replicated to other datanode(s) before the containing datanode(s) become decommissioned {quote}2024-11-05 14:52:36,770 INFO blockmanagement.DatanodeAdminManager: Starting decommission of 172.31.95.159:9866 [DISK]DS-07f2256b-0cbc-4b5e-9c6c-9108650ee896:NORMAL:172.31.95.159:9866 with 0 blocks 2024-11-05 14:52:36,770 INFO blockmanagement.DatanodeAdminManager: Starting decommission of 172.31.95.159:9866 [DISK]DS-91919280-df53-4dac-b983-0eb12c81e4bf:NORMAL:172.31.95.159:9866 with 1 blocks 2024-11-05 14:52:48,229 INFO blockmanagement.DatanodeAdminDefaultMonitor: Remove any stale pending Node same as added node 172.31.95.159:9866 with UUID eedcda8a-144d-4905-b73f-71a1787076db which is in state Decommission In Progress and isAlive is true 2024-11-05 14:52:48,231 INFO BlockStateChange: Block: blk_1073741825_1001, Expected Replicas: 1, live replicas: 0, corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 1, maintenance replicas: 0, live entering maintenance replicas: 0, replicas on stale nodes:0, readonly replicas: 0, excess replicas: 0, Is Open File: false, Datanodes having this block: 172.31.95.159:9866 , Current Datanode: 172.31.95.159:9866, Is current datanode decommissioning: true, Is current datanode entering maintenance: false 2024-11-05 14:52:48,231 INFO blockmanagement.DatanodeAdminDefaultMonitor: Node 172.31.95.159:9866 still has 1 blocks to replicate before it is a candidate to finish Decommission In Progress. 2024-11-05 14:53:18,231 DEBUG blockmanagement.DatanodeAdminDefaultMonitor: Processing Decommission In Progress node 172.31.95.159:9866 2024-11-05 14:53:18,236 INFO blockmanagement.DatanodeAdminManager: Decommissioning complete for node 172.31.95.159:9866 2024-11-05 14:53:18,236 DEBUG blockmanagement.DatanodeAdminDefaultMonitor: Node 172.31.95.159:9866 is sufficiently replicated and healthy, marked as Decommissioned. {quote} * Terminate the decommissioned datanode(s). Then validate if there was any HDFS data loss HDFS file is still readable because the block was replicated to another live datanode _172.31.90.98_ h3. DEBUG Logs * blk_1073741825_1001 {quote}2024-11-05 14:49:49,346 DEBUG hdfs.StateChange: DIR* FSDirectory.addBlock: /tmp/testfile-587e1ad8-f8db-4b3a-b4a5-74f26dbda4af with blk_1073741825_1001 block is added to the in-memory file system 2024-11-05 14:49:49,346 INFO hdfs.StateChange: BLOCK* allocate blk_1073741825_1001, replicas=172.31.95.159:9866 for /tmp/testfile-587e1ad8-f8db-4b3a-b4a5-74f26dbda4af 2024-11-05 14:49:49,346 DEBUG namenode.FSEditLog: logEdit [RpcEdit op:AddBlockOp [path=/tmp/testfile-587e1ad8-f8db-4b3a-b4a5-74f26dbda4af, penultimateBlock=NULL, lastBlock=blk_1073741825_1001, RpcClientId=, RpcCallId=-2] call:Call#1 Retry#0 org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from ip-172-31-95-158.ec2.internal:55366 / 172.31.95.158:55366 2024-11-05 14:49:49,346 DEBUG hdfs.StateChange: persistNewBlock: /tmp/testfile-587e1ad8-f8db-4b3a-b4a5-74f26dbda4af with new block blk_1073741825_1001, current total block count is 1 2024-11-05 14:49:49,346 DEBUG namenode.FSEditLog: doEditTx() op=AddBlockOp [path=/tmp/testfile-587e1ad8-f8db-4b3a-b4a5-74f26dbda4af, penultimateBlock=NULL, lastBlock=blk_1073741825_1001, RpcClientId=, RpcCallId=-2] txid=57 2024-11-05 14:49:49,522 DEBUG namenode.FSEditLog: logEdit [RpcEdit op:UpdateBlocksOp [path=/tmp/testfile-587e1ad8-f8db-4b3a-b4a5-74f26dbda4af, blocks=[blk_1073741825_1001], RpcClientId=, RpcCallId=-2] call:Call#3 Retry#0 org.apache.hadoop.hdfs.protocol.ClientProtocol.fsync from ip-172-31-95-158.ec2.internal:55366 / 172.31.95.158:55366] 2024-11-05 14:49:49,522 DEBUG namenode.FSEditLog: logSync [RpcEdit op:UpdateBlocksOp [path=/tmp/testfile-587e1ad8-f8db-4b3a-b4a5-74f26dbda4af, blocks=[blk_1073741825_1001], RpcClientId=, RpcCallId=-2] call:Call#3 Retry#0 org.apache.hadoop.hdfs.protocol.ClientProtocol.fsync from ip-172-31-95-158.ec2.internal:55366 / 172.31.95.158:55366] 2024-11-05 14:49:49,522 DEBUG namenode.FSEditLog: doEditTx() op=UpdateBlocksOp [path=/tmp/testfile-587e1ad8-f8db-4b3a-b4a5-74f26dbda4af, blocks=[blk_1073741825_1001], RpcClientId=, RpcCallId=-2] txid=58 2024-11-05 14:49:49,556 INFO namenode.FSNamesystem: BLOCK* blk_1073741825_1001 is COMMITTED but not COMPLETE(numNodes= 0 < minimum = 1) in file /tmp/testfile-587e1ad8-f8db-4b3a-b4a5-74f26dbda4af 2024-11-05 14:49:49,562 DEBUG blockmanagement.BlockManager: Reported block blk_1073741825_1001 on 172.31.95.159:9866 size 307200 replicaState = FINALIZED 2024-11-05 14:49:49,562 DEBUG blockmanagement.BlockManager: Reported block blk_1073741825_1001 on 172.31.95.159:9866 size 307200 replicaState = FINALIZED 2024-11-05 14:49:49,563 DEBUG BlockStateChange: BLOCK* addStoredBlock: 172.31.95.159:9866 is added to blk_1073741825_1001 (size=307200) 2024-11-05 14:49:49,563 DEBUG BlockStateChange: BLOCK* block RECEIVED_BLOCK: blk_1073741825_1001 is received from 172.31.95.159:9866 2024-11-05 14:49:49,961 DEBUG namenode.FSEditLog: logEdit [RpcEdit op:CloseOp [length=0, inodeId=0, path=/tmp/testfile-587e1ad8-f8db-4b3a-b4a5-74f26dbda4af, replication=1, mtime=1730818189961, atime=1730818189286, blockSize=268435456, blocks=[blk_1073741825_1001], permissions=root:hdfsadmingroup:rw-r-{-}r{-}-, aclEntries=null, clientName=null, clientMachine=null, overwrite=false, storagePolicyId=0, erasureCodingPolicyId=0, opCode=OP_CLOSE, txid=-12345] call:Call#5 Retry#0 org.apache.hadoop.hdfs.protocol.ClientProtocol.complete from ip-172-31-95-158.ec2.internal:55366 / 172.31.95.158:55366 2024-11-05 14:49:49,961 DEBUG namenode.FSEditLog: doEditTx() op=CloseOp [length=0, inodeId=0, path=/tmp/testfile-587e1ad8-f8db-4b3a-b4a5-74f26dbda4af, replication=1, mtime=1730818189961, atime=1730818189286, blockSize=268435456, blocks=[blk_1073741825_1001], permissions=root:hdfsadmingroup:rw-r-{-}r{-}-, aclEntries=null, clientName=null, clientMachine=null, overwrite=false, storagePolicyId=0, erasureCodingPolicyId=0, opCode=OP_CLOSE, txid=59] txid=59 2024-11-05 14:49:49,961 DEBUG namenode.FSEditLog: logSync [RpcEdit op:CloseOp [length=0, inodeId=0, path=/tmp/testfile-587e1ad8-f8db-4b3a-b4a5-74f26dbda4af, replication=1, mtime=1730818189961, atime=1730818189286, blockSize=268435456, blocks=[blk_1073741825_1001], permissions=root:hdfsadmingroup:rw-r-{-}r{-}-, aclEntries=null, clientName=null, clientMachine=null, overwrite=false, storagePolicyId=0, erasureCodingPolicyId=0, opCode=OP_CLOSE, txid=59] call:Call#5 Retry#0 org.apache.hadoop.hdfs.protocol.ClientProtocol.complete from ip-172-31-95-158.ec2.internal:55366 / 172.31.95.158:55366 {quote} * blk_1073741825_1002 {quote}2024-11-05 14:52:48,231 DEBUG BlockStateChange: BLOCK* NameSystem.LowRedundancyBlock.add: blk_1073741825_1001 has only 0 replicas and need 1 replicas so is added to neededReconstructions at priority level 0 2024-11-05 14:52:48,231 INFO BlockStateChange: Block: blk_1073741825_1001, Expected Replicas: 1, live replicas: 0, corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 1, maintenance replicas: 0, live entering maintenance replicas: 0, replicas on stale nodes:0, readonly replicas: 0, excess replicas: 0, Is Open File: false, Datanodes having this block: 172.31.95.159:9866 , Current Datanode: 172.31.95.159:9866, Is current datanode decommissioning: true, Is current datanode entering maintenance: false 2024-11-05 14:52:48,263 DEBUG blockmanagement.BlockReconstructionWork: Creating a ReplicationWork to reconstruct blk_1073741825_1001 2024-11-05 14:52:48,264 DEBUG BlockStateChange: BLOCK* block blk_1073741825_1001 is moved from neededReconstruction to pendingReconstruction 2024-11-05 14:52:48,264 DEBUG BlockStateChange: BLOCK* NameSystem.LowRedundancyBlock.remove: Removing block blk_1073741825_1001 from priority queue 0 2024-11-05 14:52:48,264 DEBUG BlockStateChange: BLOCK* ask [172.31.95.159:9866] to replicate blk_1073741825_1001 to datanode(s) 172.31.90.98:9866 2024-11-05 14:52:49,788 DEBUG blockmanagement.BlockManager: Removing pending reconstruction for blk_1073741825_1001 2024-11-05 14:52:49,788 DEBUG blockmanagement.BlockManager: Reported block blk_1073741825_1001 on 172.31.90.98:9866 size 307200 replicaState = FINALIZED 2024-11-05 14:52:49,788 DEBUG blockmanagement.BlockManager: Reported block blk_1073741825_1001 on 172.31.90.98:9866 size 307200 replicaState = FINALIZED 2024-11-05 14:52:49,788 DEBUG BlockStateChange: BLOCK* addStoredBlock: 172.31.90.98:9866 is added to blk_1073741825_1001 (size=307200) 2024-11-05 14:52:49,788 DEBUG BlockStateChange: BLOCK* block RECEIVED_BLOCK: blk_1073741825_1001 is received from 172.31.90.98:9866 2024-11-05 14:57:57,585 DEBUG blockmanagement.BlockManager: blocks = [blk_1073741825_1001] {quote} was (Author: kevinwikant): h2. Test#2: Create Block & Repeatedly Append -> Close DFSOutputStream -> Decommission Datanode (dfs.replication = 1) * Create a new HDFS cluster (with 0 blocks to begin with) * Run the test script to execute the desired test case {quote}2024-11-05 14:49:49 Append Success 100 KB 2024-11-05 14:49:49 Append Success 200 KB 2024-11-05 14:49:49 Append Success 300 KB {quote} * Check the namenode logs to identify which datanode(s) have the block replica(s) {quote}2024-11-05 14:49:49,346 INFO hdfs.StateChange: BLOCK* allocate blk_1073741825_1001, replicas=172.31.95.159:9866 for /tmp/testfile-587e1ad8-f8db-4b3a-b4a5-74f26dbda4af {quote} * Start decommissioning all the datanode(s) with block replica(s) * Check the namenode logs to check if the block replica(s) are considered by the DatanodeAdminManager & are replicated to other datanode(s) before the containing datanode(s) become decommissioned {quote}2024-11-05 14:52:36,770 INFO blockmanagement.DatanodeAdminManager: Starting decommission of 172.31.95.159:9866 [DISK]DS-07f2256b-0cbc-4b5e-9c6c-9108650ee896:NORMAL:172.31.95.159:9866 with 0 blocks 2024-11-05 14:52:36,770 INFO blockmanagement.DatanodeAdminManager: Starting decommission of 172.31.95.159:9866 [DISK]DS-91919280-df53-4dac-b983-0eb12c81e4bf:NORMAL:172.31.95.159:9866 with 1 blocks 2024-11-05 14:52:48,229 INFO blockmanagement.DatanodeAdminDefaultMonitor: Remove any stale pending Node same as added node 172.31.95.159:9866 with UUID eedcda8a-144d-4905-b73f-71a1787076db which is in state Decommission In Progress and isAlive is true 2024-11-05 14:52:48,231 INFO BlockStateChange: Block: blk_1073741825_1001, Expected Replicas: 1, live replicas: 0, corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 1, maintenance replicas: 0, live entering maintenance replicas: 0, replicas on stale nodes:0, readonly replicas: 0, excess replicas: 0, Is Open File: false, Datanodes having this block: 172.31.95.159:9866 , Current Datanode: 172.31.95.159:9866, Is current datanode decommissioning: true, Is current datanode entering maintenance: false 2024-11-05 14:52:48,231 INFO blockmanagement.DatanodeAdminDefaultMonitor: Node 172.31.95.159:9866 still has 1 blocks to replicate before it is a candidate to finish Decommission In Progress. 2024-11-05 14:53:18,231 DEBUG blockmanagement.DatanodeAdminDefaultMonitor: Processing Decommission In Progress node 172.31.95.159:9866 2024-11-05 14:53:18,236 INFO blockmanagement.DatanodeAdminManager: Decommissioning complete for node 172.31.95.159:9866 2024-11-05 14:53:18,236 DEBUG blockmanagement.DatanodeAdminDefaultMonitor: Node 172.31.95.159:9866 is sufficiently replicated and healthy, marked as Decommissioned. {quote} * Terminate the decommissioned datanode(s). Then validate if there was any HDFS data loss HDFS file is still readable because the block was replicated to another live datanode _172.31.90.98_ h3. DEBUG Logs * blk_1073741825_1001 {quote}2024-11-05 14:49:49,346 DEBUG hdfs.StateChange: DIR* FSDirectory.addBlock: /tmp/testfile-587e1ad8-f8db-4b3a-b4a5-74f26dbda4af with blk_1073741825_1001 block is added to the in-memory file system 2024-11-05 14:49:49,346 INFO hdfs.StateChange: BLOCK* allocate blk_1073741825_1001, replicas=172.31.95.159:9866 for /tmp/testfile-587e1ad8-f8db-4b3a-b4a5-74f26dbda4af 2024-11-05 14:49:49,346 DEBUG namenode.FSEditLog: logEdit [RpcEdit op:AddBlockOp [path=/tmp/testfile-587e1ad8-f8db-4b3a-b4a5-74f26dbda4af, penultimateBlock=NULL, lastBlock=blk_1073741825_1001, RpcClientId=, RpcCallId=-2] call:Call#1 Retry#0 org.apache.hadoop.hdfs.protocol.ClientProtocol.addBlock from ip-172-31-95-158.ec2.internal:55366 / 172.31.95.158:55366 2024-11-05 14:49:49,346 DEBUG hdfs.StateChange: persistNewBlock: /tmp/testfile-587e1ad8-f8db-4b3a-b4a5-74f26dbda4af with new block blk_1073741825_1001, current total block count is 1 2024-11-05 14:49:49,346 DEBUG namenode.FSEditLog: doEditTx() op=AddBlockOp [path=/tmp/testfile-587e1ad8-f8db-4b3a-b4a5-74f26dbda4af, penultimateBlock=NULL, lastBlock=blk_1073741825_1001, RpcClientId=, RpcCallId=-2] txid=57 2024-11-05 14:49:49,522 DEBUG namenode.FSEditLog: logEdit [RpcEdit op:UpdateBlocksOp [path=/tmp/testfile-587e1ad8-f8db-4b3a-b4a5-74f26dbda4af, blocks=[blk_1073741825_1001], RpcClientId=, RpcCallId=-2] call:Call#3 Retry#0 org.apache.hadoop.hdfs.protocol.ClientProtocol.fsync from ip-172-31-95-158.ec2.internal:55366 / 172.31.95.158:55366] 2024-11-05 14:49:49,522 DEBUG namenode.FSEditLog: logSync [RpcEdit op:UpdateBlocksOp [path=/tmp/testfile-587e1ad8-f8db-4b3a-b4a5-74f26dbda4af, blocks=[blk_1073741825_1001], RpcClientId=, RpcCallId=-2] call:Call#3 Retry#0 org.apache.hadoop.hdfs.protocol.ClientProtocol.fsync from ip-172-31-95-158.ec2.internal:55366 / 172.31.95.158:55366] 2024-11-05 14:49:49,522 DEBUG namenode.FSEditLog: doEditTx() op=UpdateBlocksOp [path=/tmp/testfile-587e1ad8-f8db-4b3a-b4a5-74f26dbda4af, blocks=[blk_1073741825_1001], RpcClientId=, RpcCallId=-2] txid=58 2024-11-05 14:49:49,556 INFO namenode.FSNamesystem: BLOCK* blk_1073741825_1001 is COMMITTED but not COMPLETE(numNodes= 0 < minimum = 1) in file /tmp/testfile-587e1ad8-f8db-4b3a-b4a5-74f26dbda4af 2024-11-05 14:49:49,562 DEBUG blockmanagement.BlockManager: Reported block blk_1073741825_1001 on 172.31.95.159:9866 size 307200 replicaState = FINALIZED 2024-11-05 14:49:49,562 DEBUG blockmanagement.BlockManager: Reported block blk_1073741825_1001 on 172.31.95.159:9866 size 307200 replicaState = FINALIZED 2024-11-05 14:49:49,563 DEBUG BlockStateChange: BLOCK* addStoredBlock: 172.31.95.159:9866 is added to blk_1073741825_1001 (size=307200) 2024-11-05 14:49:49,563 DEBUG BlockStateChange: BLOCK* block RECEIVED_BLOCK: blk_1073741825_1001 is received from 172.31.95.159:9866 2024-11-05 14:49:49,961 DEBUG namenode.FSEditLog: logEdit [RpcEdit op:CloseOp [length=0, inodeId=0, path=/tmp/testfile-587e1ad8-f8db-4b3a-b4a5-74f26dbda4af, replication=1, mtime=1730818189961, atime=1730818189286, blockSize=268435456, blocks=[blk_1073741825_1001], permissions=root:hdfsadmingroup:rw-r--r--, aclEntries=null, clientName=null, clientMachine=null, overwrite=false, storagePolicyId=0, erasureCodingPolicyId=0, opCode=OP_CLOSE, txid=-12345] call:Call#5 Retry#0 org.apache.hadoop.hdfs.protocol.ClientProtocol.complete from ip-172-31-95-158.ec2.internal:55366 / 172.31.95.158:55366 2024-11-05 14:49:49,961 DEBUG namenode.FSEditLog: doEditTx() op=CloseOp [length=0, inodeId=0, path=/tmp/testfile-587e1ad8-f8db-4b3a-b4a5-74f26dbda4af, replication=1, mtime=1730818189961, atime=1730818189286, blockSize=268435456, blocks=[blk_1073741825_1001], permissions=root:hdfsadmingroup:rw-r--r--, aclEntries=null, clientName=null, clientMachine=null, overwrite=false, storagePolicyId=0, erasureCodingPolicyId=0, opCode=OP_CLOSE, txid=59] txid=59 2024-11-05 14:49:49,961 DEBUG namenode.FSEditLog: logSync [RpcEdit op:CloseOp [length=0, inodeId=0, path=/tmp/testfile-587e1ad8-f8db-4b3a-b4a5-74f26dbda4af, replication=1, mtime=1730818189961, atime=1730818189286, blockSize=268435456, blocks=[blk_1073741825_1001], permissions=root:hdfsadmingroup:rw-r--r--, aclEntries=null, clientName=null, clientMachine=null, overwrite=false, storagePolicyId=0, erasureCodingPolicyId=0, opCode=OP_CLOSE, txid=59] call:Call#5 Retry#0 org.apache.hadoop.hdfs.protocol.ClientProtocol.complete from ip-172-31-95-158.ec2.internal:55366 / 172.31.95.158:55366 {quote} * blk_1073741825_1002 {quote}2024-11-05 14:52:48,231 DEBUG BlockStateChange: BLOCK* NameSystem.LowRedundancyBlock.add: blk_1073741825_1001 has only 0 replicas and need 1 replicas so is added to neededReconstructions at priority level 0 2024-11-05 14:52:48,231 INFO BlockStateChange: Block: blk_1073741825_1001, Expected Replicas: 1, live replicas: 0, corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 1, maintenance replicas: 0, live entering maintenance replicas: 0, replicas on stale nodes:0, readonly replicas: 0, excess replicas: 0, Is Open File: false, Datanodes having this block: 172.31.95.159:9866 , Current Datanode: 172.31.95.159:9866, Is current datanode decommissioning: true, Is current datanode entering maintenance: false 2024-11-05 14:52:48,263 DEBUG blockmanagement.BlockReconstructionWork: Creating a ReplicationWork to reconstruct blk_1073741825_1001 2024-11-05 14:52:48,264 DEBUG BlockStateChange: BLOCK* block blk_1073741825_1001 is moved from neededReconstruction to pendingReconstruction 2024-11-05 14:52:48,264 DEBUG BlockStateChange: BLOCK* NameSystem.LowRedundancyBlock.remove: Removing block blk_1073741825_1001 from priority queue 0 2024-11-05 14:52:48,264 DEBUG BlockStateChange: BLOCK* ask [172.31.95.159:9866] to replicate blk_1073741825_1001 to datanode(s) 172.31.90.98:9866 2024-11-05 14:52:49,788 DEBUG blockmanagement.BlockManager: Removing pending reconstruction for blk_1073741825_1001 2024-11-05 14:52:49,788 DEBUG blockmanagement.BlockManager: Reported block blk_1073741825_1001 on 172.31.90.98:9866 size 307200 replicaState = FINALIZED 2024-11-05 14:52:49,788 DEBUG blockmanagement.BlockManager: Reported block blk_1073741825_1001 on 172.31.90.98:9866 size 307200 replicaState = FINALIZED 2024-11-05 14:52:49,788 DEBUG BlockStateChange: BLOCK* addStoredBlock: 172.31.90.98:9866 is added to blk_1073741825_1001 (size=307200) 2024-11-05 14:52:49,788 DEBUG BlockStateChange: BLOCK* block RECEIVED_BLOCK: blk_1073741825_1001 is received from 172.31.90.98:9866 2024-11-05 14:57:57,585 DEBUG blockmanagement.BlockManager: blocks = [blk_1073741825_1001] {quote} > HDFS decommissioning does not consider if Under Construction blocks are > sufficiently replicated which causes HDFS Data Loss > --------------------------------------------------------------------------------------------------------------------------- > > Key: HDFS-17658 > URL: https://issues.apache.org/jira/browse/HDFS-17658 > Project: Hadoop HDFS > Issue Type: Improvement > Affects Versions: 3.4.0 > Reporter: Kevin Wikant > Priority: Major > > h2. Background > The HDFS Namenode manages datanode decommissioning using the > DatanodeAdminManager > The DatanodeAdminManager has logic to prevent transitioning datanodes to > decommissioned state if they contain open blocks (i.e. Under Construction > blocks) which are not sufficiently replicated to other datanodes. In the > context of decomissioning, the reason that open blocks are important is > because they cannot be replicated to other datanodes given they are actively > being appended by an HDFS client. Because open blocks cannot be moved during > decommissioning, they should prevent the associated datanodes from becoming > decommissioned until the block is completed/finalized and it can be safely > moved to other live datanode(s). > > h2. Problem > The logic for DatanodeAdminManager to avoid decommissioning datanodes which > contain open blocks does not properly consider blocks which are in Under > Construction state. The DatanodeAdminManager is only considering blocks which > are in committed/finalized state. For reference: > * a block which is actively being appended by a DFSOutputStream is in Under > Construction state > * only when a DFSOutputStream is closed does the block transition from Under > Construction state to Committed/Finalized state > * then later when the block is reported to the namenode in a block report, > it will transition from Committed to Completed state > Furthermore: > * this is true for a new file/block that was just created via > "DFSClient.create" > * this is true for an existing file/block that was just opened via > "DFSClient.append" > * this is true for all different dfs.replication factor values > > h2. Impact > In some environments, a datanode being decommissioned is taken as a signal > that all the blocks on that datanodes are sufficiently replicated to other > live datanodes & therefore it is safe to terminate the underlying virtual > host running the datanode. > These types of environments can be impacted by HDFS data loss for open blocks > which are not considered in the datanode decommissioning process. Since open > blocks are not considered in determining if a datanode can be decommissioned, > a datanode may become decommissioned before its open blocks are replicated to > other datanodes. If the decommissioned datanode is then terminated, the open > blocks will be lost. > For open blocks with replication of 1, it takes a single > decommissioned/terminated datanode to cause data loss. For open blocks with > greater replication, all the datanodes holding the open block must be > decommissioned/terminated for there to be data loss. > I would also break the impact down into 2 cases: > * for a new HDFS block that has never been closed, arguably if the > DFSOutputStream encounters failure then the client should be able to replay > the data from source > * for an existing HDFS block that has previously been closed, when this > block is opened via a new DFSOutputStream the block is susceptible to being > lost & the client cannot replay data which was appended in the past by a > different client. This is arguably the worse case of impact. > > h2. Testing > This behaviour has been verified via testing on Hadoop 3.4.0; however, I > suspect it also applies to many other older/newer Hadoop versions. > See JIRA comments for detailed test methodology & results. > The following is a summary of the test cases & test results: > {quote}*Test#1: Create Block & Repeatedly Append in Loop → Decommission > Datanode* > ^ Expectation: block is considered during decommissioning; datanode is not > decommissioned until the write operation is finished & block is replicated to > another datanode. > ^ Observation: block is not considered during decommissioning & is lost when > decommissioned data is terminated. > *Test#2: Create Block & Repeatedly Append in Loop → Close DFSOutputStream → > Decommission Datanode* > ^ Expectation: block is considered during decommissioning & is replicated to > another datanode as part of decommissioning. > ^ Observation: block is considered during decommissioning & is replicated to > another datanode as part of decommissioning. > *Test#3: Create Block & Repeatedly Append in Loop → Close DFSOutputStream → > Re-open Block & Repeatedly Append in Loop → Decommission Datanode* > ^ Expectation: block is considered during decommissioning; datanode is not > decommissioned until the write operation is finished & block is replicated to > another datanode. > ^ Observation: block is not considered during decommissioning & is lost when > decommissioned data is terminated. > *Test#4: Create Block & Repeatedly Append in Loop → Close DFSOutputStream → > Re-open Block & Repeatedly Append in Loop → Close DFSOutputStream → > Decommission Datanode* > ^ Expectation: block is considered during decommissioning & is replicated to > another datanode as part of decommissioning. > ^ Observation: block is not considered during decommissioning & is lost when > decommissioned data is terminated. > {quote} > These were all tested with replication factor of 1 & 2, observation is the > same in both cases. > > h2. Root Cause Theory > The DatanodeAdminManager relies on the DatanodeDescriptor StorageInfos to > identify which blocks to consider as part of wether or not a datanode can be > decommissioned. > Based on an examination of the HDFS Namenode & Datanode DEBUG logs during > testing, I believe the root cause has to do with the following 2 behaviours: > {{*1.* When a DFSOutputStream is created for a block, that block enters Under > Construction state but the Namenode does not add Under Construction blocks to > the StorageInfos unless they have been committed/finalized which only occurs > when the DFSOutputStream is closed. Therefore, by checking the StorageInfos > only, the DatanodeAdminManager is not actually checking Under Construction > blocks.}} > During the testing, we see that the Under Construction blocks are not in the > StorageInfos for the corresponding DatanodeDescriptor: > {quote}2024-11-05 14:32:19,206 DEBUG BlockStateChange: BLOCK* block > RECEIVING_BLOCK: blk_1073741825_1001 is received from 172.31.93.123:9866 > ... > 2024-11-05 14:36:02,805 INFO blockmanagement.DatanodeAdminManager: Starting > decommission of 172.31.93.123:9866 > [DISK]DS-bb1d316c-a47a-4a3a-bf6e-0781945a50d1:NORMAL:172.31.93.123:9866 with > 0 blocks > 2024-11-05 14:36:02,805 INFO blockmanagement.DatanodeAdminManager: Starting > decommission of 172.31.93.123:9866 > [DISK]DS-95cfd2fa-25b0-4b20-aacf-e259201cf2eb:NORMAL:172.31.93.123:9866 with > 0 blocks > {quote} > Whereas, if the DFSOutputStream is closed, we see the block is included in > the StorageInfos: > {quote}2024-11-05 14:49:49,563 DEBUG BlockStateChange: BLOCK* block > RECEIVED_BLOCK: blk_1073741825_1001 is received from 172.31.95.159:9866 > ... > 2024-11-05 14:52:36,770 INFO blockmanagement.DatanodeAdminManager: Starting > decommission of 172.31.95.159:9866 > [DISK]DS-07f2256b-0cbc-4b5e-9c6c-9108650ee896:NORMAL:172.31.95.159:9866 with > 0 blocks > 2024-11-05 14:52:36,770 INFO blockmanagement.DatanodeAdminManager: Starting > decommission of 172.31.95.159:9866 > [DISK]DS-91919280-df53-4dac-b983-0eb12c81e4bf:NORMAL:172.31.95.159:9866 with > 1 blocks > 2024-11-05 14:52:48,231 INFO BlockStateChange: Block: blk_1073741825_1001, > Expected Replicas: 1, live replicas: 0, corrupt replicas: 0, decommissioned > replicas: 0, decommissioning replicas: 1, maintenance replicas: 0, live > entering maintenance replicas: 0, replicas on stale nodes:0, readonly > replicas: 0, excess replicas: 0, Is Open File: false, Datanodes having this > block: 172.31.95.159:9866 , Current Datanode: 172.31.95.159:9866, Is current > datanode decommissioning: true, Is current datanode entering maintenance: > false > {quote} > I think this behaviour might be related to the following code which has been > in Hadoop for the past 10+ years: > [https://github.com/apache/hadoop/blob/51ebc3c20e8ae7d4dced41cdd2f52715aea604cc/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java#L3708] > > > {{*2.* When a DFSOutputStream is created for an existing block, the Namenode > marks the existing block with previous generation stamp as stale & removes it > from the StorageInfos.}} > We can see the following DEBUG logs during repro testing: > {quote}2024-11-05 15:47:24,131 INFO namenode.FSNamesystem: > updatePipeline(blk_1073741825_1001, newGS=1002, newLength=307200, > newNodes=[172.31.95.208:9866], client=DFSClient_NONMAPREDUCE_-1408310900_1) > 2024-11-05 15:47:24,132 DEBUG BlockStateChange: BLOCK* Removing stale replica > ReplicaUC[[DISK]DS-1bd20290-4662-4e5e-af0b-9b644d78d4f8:NORMAL:172.31.95.208:9866|RBW] > of blk_1073741825_1001 > 2024-11-05 15:47:24,132 INFO namenode.FSNamesystem: > updatePipeline(blk_1073741825_1001 => blk_1073741825_1002) success > {quote} > Tracing the code from the logs we see where this occurs: > * > [https://github.com/apache/hadoop/blob/f7651e2f63ddba9ed1ae4052e38464f85dd445f0/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java#L4476] > * > [https://github.com/apache/hadoop/blob/f7651e2f63ddba9ed1ae4052e38464f85dd445f0/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java#L4433] > * > [https://github.com/apache/hadoop/blob/f7651e2f63ddba9ed1ae4052e38464f85dd445f0/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlocksMap.java#L202] > > > h3. Potential Solution > If possible, can consider adding Under Construction blocks to StorageInfos. > If this would have other adverse impacts, then another solution is to expose > the Under Construction blocks to the DatanodeAdminManager via another > separate data structure. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org