[
https://issues.apache.org/jira/browse/HDFS-17658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18021018#comment-18021018
]
ASF GitHub Bot commented on HDFS-17658:
---------------------------------------
github-actions[bot] commented on PR #7179:
URL: https://github.com/apache/hadoop/pull/7179#issuecomment-3304941056
We're closing this stale PR because it has been open for 100 days with no
activity. This isn't a judgement on the merit of the PR in any way. It's just a
way of keeping the PR queue manageable.
If you feel like this was a mistake, or you would like to continue working
on it, please feel free to re-open it and ask for a committer to remove the
stale tag and review again.
Thanks all for your contribution.
> HDFS decommissioning does not consider if Under Construction blocks are
> sufficiently replicated which causes HDFS Data Loss
> ---------------------------------------------------------------------------------------------------------------------------
>
> Key: HDFS-17658
> URL: https://issues.apache.org/jira/browse/HDFS-17658
> Project: Hadoop HDFS
> Issue Type: Improvement
> Affects Versions: 3.4.0
> Reporter: Kevin Wikant
> Priority: Major
> Labels: pull-request-available
>
> h2. Background
> The HDFS Namenode manages datanode decommissioning using the
> DatanodeAdminManager
> The DatanodeAdminManager has logic to prevent transitioning datanodes to
> decommissioned state if they contain open blocks (i.e. Under Construction
> blocks) which are not sufficiently replicated to other datanodes. In the
> context of decomissioning, the reason that open blocks are important is
> because they cannot be replicated to other datanodes given they are actively
> being appended by an HDFS client. Because open blocks cannot be moved during
> decommissioning, they should prevent the associated datanodes from becoming
> decommissioned until the block is completed/finalized and it can be safely
> moved to other live datanode(s).
>
> h2. Problem
> The logic for DatanodeAdminManager to avoid decommissioning datanodes which
> contain open blocks does not properly consider blocks which are in Under
> Construction state. The DatanodeAdminManager is only considering blocks which
> are in committed/finalized state. For reference:
> * a block which is actively being appended by a DFSOutputStream is in Under
> Construction state
> * only when a DFSOutputStream is closed does the block transition from Under
> Construction state to Committed/Finalized state
> * then later when the block is reported to the namenode in a block report,
> it will transition from Committed to Completed state
> Furthermore:
> * this is true for a new file/block that was just created via
> "DFSClient.create"
> * this is true for an existing file/block that was just opened via
> "DFSClient.append"
> * this is true for all different dfs.replication factor values
>
> h2. Impact
> In some environments, a datanode being decommissioned is taken as a signal
> that all the blocks on that datanodes are sufficiently replicated to other
> live datanodes & therefore it is safe to terminate the underlying virtual
> host running the datanode.
> These types of environments can be impacted by HDFS data loss for open blocks
> which are not considered in the datanode decommissioning process. Since open
> blocks are not considered in determining if a datanode can be decommissioned,
> a datanode may become decommissioned before its open blocks are replicated to
> other datanodes. If the decommissioned datanode is then terminated, the open
> blocks will be lost.
> For open blocks with replication of 1, it takes a single
> decommissioned/terminated datanode to cause data loss. For open blocks with
> greater replication, all the datanodes holding the open block must be
> decommissioned/terminated for there to be data loss.
> I would also break the impact down into 2 cases:
> * for a new HDFS block that has never been closed, arguably if the
> DFSOutputStream encounters failure then the client should be able to replay
> the data from source
> * for an existing HDFS block that has previously been closed, when this
> block is opened via a new DFSOutputStream the block is susceptible to being
> lost & the client cannot replay data which was appended in the past by a
> different client. This is arguably the worse case of impact.
>
> h2. Testing
> This behaviour has been verified via testing on Hadoop 3.4.0; however, I
> suspect it also applies to many other older/newer Hadoop versions.
> See JIRA comments for detailed test methodology & results.
> The following is a summary of the test cases & test results:
> {quote}*Test#1: Create Block & Repeatedly Append in Loop → Decommission
> Datanode*
> ^ Expectation: block is considered during decommissioning; datanode is not
> decommissioned until the write operation is finished & block is replicated to
> another datanode.
> ^ Observation: block is not considered during decommissioning & is lost when
> decommissioned data is terminated.
> *Test#2: Create Block & Repeatedly Append in Loop → Close DFSOutputStream →
> Decommission Datanode*
> ^ Expectation: block is considered during decommissioning & is replicated to
> another datanode as part of decommissioning.
> ^ Observation: block is considered during decommissioning & is replicated to
> another datanode as part of decommissioning.
> *Test#3: Create Block & Repeatedly Append in Loop → Close DFSOutputStream →
> Re-open Block & Repeatedly Append in Loop → Decommission Datanode*
> ^ Expectation: block is considered during decommissioning; datanode is not
> decommissioned until the write operation is finished & block is replicated to
> another datanode.
> ^ Observation: block is not considered during decommissioning & is lost when
> decommissioned data is terminated.
> *Test#4: Create Block & Repeatedly Append in Loop → Close DFSOutputStream →
> Re-open Block & Repeatedly Append in Loop → Close DFSOutputStream →
> Decommission Datanode*
> ^ Expectation: block is considered during decommissioning & is replicated to
> another datanode as part of decommissioning.
> ^ Observation: block is not considered during decommissioning & is lost when
> decommissioned data is terminated.
> {quote}
> These were all tested with replication factor of 1 & 2, observation is the
> same in both cases.
>
> h2. Root Cause Theory
> The DatanodeAdminManager relies on the DatanodeDescriptor StorageInfos to
> identify which blocks to consider as part of wether or not a datanode can be
> decommissioned.
> Based on an examination of the HDFS Namenode & Datanode DEBUG logs during
> testing, I believe the root cause has to do with the following 2 behaviours:
> {{*1.* When a DFSOutputStream is created for a block, that block enters Under
> Construction state but the Namenode does not add Under Construction blocks to
> the StorageInfos unless they have been committed/finalized which only occurs
> when the DFSOutputStream is closed. Therefore, by checking the StorageInfos
> only, the DatanodeAdminManager is not actually checking Under Construction
> blocks.}}
> During the testing, we see that the Under Construction blocks are not in the
> StorageInfos for the corresponding DatanodeDescriptor:
> {quote}2024-11-05 14:32:19,206 DEBUG BlockStateChange: BLOCK* block
> RECEIVING_BLOCK: blk_1073741825_1001 is received from 172.31.93.123:9866
> ...
> 2024-11-05 14:36:02,805 INFO blockmanagement.DatanodeAdminManager: Starting
> decommission of 172.31.93.123:9866
> [DISK]DS-bb1d316c-a47a-4a3a-bf6e-0781945a50d1:NORMAL:172.31.93.123:9866 with
> 0 blocks
> 2024-11-05 14:36:02,805 INFO blockmanagement.DatanodeAdminManager: Starting
> decommission of 172.31.93.123:9866
> [DISK]DS-95cfd2fa-25b0-4b20-aacf-e259201cf2eb:NORMAL:172.31.93.123:9866 with
> 0 blocks
> {quote}
> Whereas, if the DFSOutputStream is closed, we see the block is included in
> the StorageInfos:
> {quote}2024-11-05 14:49:49,563 DEBUG BlockStateChange: BLOCK* block
> RECEIVED_BLOCK: blk_1073741825_1001 is received from 172.31.95.159:9866
> ...
> 2024-11-05 14:52:36,770 INFO blockmanagement.DatanodeAdminManager: Starting
> decommission of 172.31.95.159:9866
> [DISK]DS-07f2256b-0cbc-4b5e-9c6c-9108650ee896:NORMAL:172.31.95.159:9866 with
> 0 blocks
> 2024-11-05 14:52:36,770 INFO blockmanagement.DatanodeAdminManager: Starting
> decommission of 172.31.95.159:9866
> [DISK]DS-91919280-df53-4dac-b983-0eb12c81e4bf:NORMAL:172.31.95.159:9866 with
> 1 blocks
> 2024-11-05 14:52:48,231 INFO BlockStateChange: Block: blk_1073741825_1001,
> Expected Replicas: 1, live replicas: 0, corrupt replicas: 0, decommissioned
> replicas: 0, decommissioning replicas: 1, maintenance replicas: 0, live
> entering maintenance replicas: 0, replicas on stale nodes:0, readonly
> replicas: 0, excess replicas: 0, Is Open File: false, Datanodes having this
> block: 172.31.95.159:9866 , Current Datanode: 172.31.95.159:9866, Is current
> datanode decommissioning: true, Is current datanode entering maintenance:
> false
> {quote}
> I think this behaviour might be related to the following code which has been
> in Hadoop for the past 10+ years:
> [https://github.com/apache/hadoop/blob/51ebc3c20e8ae7d4dced41cdd2f52715aea604cc/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java#L3708]
>
>
> {{*2.* When a DFSOutputStream is created for an existing block, the Namenode
> marks the existing block with previous generation stamp as stale & removes it
> from the StorageInfos.}}
> We can see the following DEBUG logs during repro testing:
> {quote}2024-11-05 15:47:24,131 INFO namenode.FSNamesystem:
> updatePipeline(blk_1073741825_1001, newGS=1002, newLength=307200,
> newNodes=[172.31.95.208:9866], client=DFSClient_NONMAPREDUCE_-1408310900_1)
> 2024-11-05 15:47:24,132 DEBUG BlockStateChange: BLOCK* Removing stale replica
> ReplicaUC[[DISK]DS-1bd20290-4662-4e5e-af0b-9b644d78d4f8:NORMAL:172.31.95.208:9866|RBW]
> of blk_1073741825_1001
> 2024-11-05 15:47:24,132 INFO namenode.FSNamesystem:
> updatePipeline(blk_1073741825_1001 => blk_1073741825_1002) success
> {quote}
> Tracing the code from the logs we see where this occurs:
> *
> [https://github.com/apache/hadoop/blob/f7651e2f63ddba9ed1ae4052e38464f85dd445f0/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java#L4476]
> *
> [https://github.com/apache/hadoop/blob/f7651e2f63ddba9ed1ae4052e38464f85dd445f0/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java#L4433]
> *
> [https://github.com/apache/hadoop/blob/f7651e2f63ddba9ed1ae4052e38464f85dd445f0/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlocksMap.java#L202]
>
>
> h3. Potential Solution
> If possible, can consider adding Under Construction blocks to StorageInfos.
> If this would have other adverse impacts, then another solution is to expose
> the Under Construction blocks to the DatanodeAdminManager via another
> separate data structure.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]