[jira] [Commented] (HDFS-17658) HDFS decommissioning does not consider if Under Construction blocks are sufficiently replicated which causes HDFS Data Loss

ASF GitHub Bot (Jira) Thu, 19 Dec 2024 09:56:56 -0800


    [ 
https://issues.apache.org/jira/browse/HDFS-17658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17907146#comment-17907146
 ]


ASF GitHub Bot commented on HDFS-17658:
---------------------------------------

KevinWikant commented on code in PR #7179:
URL: https://github.com/apache/hadoop/pull/7179#discussion_r1892907319


##########
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/UnderConstructionBlocks.java:
##########
@@ -0,0 +1,331 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.hadoop.hdfs.server.blockmanagement;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.hdfs.DFSConfigKeys;
+import org.apache.hadoop.hdfs.protocol.Block;
+import org.apache.hadoop.thirdparty.com.google.common.collect.Maps;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import java.time.Duration;
+import java.time.Instant;
+import java.util.Collection;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+
+/**
+ * The BlockManager will not add an Under Construction
+ * block to the DatanodeDescriptor StorageInfos until
+ * the block is fully committed and finalized.
+ * The UC block replicas are instead tracked here
+ * for the DatanodeAdminManager to use.
+ * Note that this is tracked in-memory only, as such
+ * some Under Construction blocks may be missed under
+ * scenarios where Namenode is restarted.
+ **/
+public class UnderConstructionBlocks {
+  private static final Logger LOG =
+          LoggerFactory.getLogger(UnderConstructionBlocks.class);
+
+  // Amount of time to wait in between checking all block replicas
+  private static final Duration LONG_UNDER_CONSTRUCTION_BLOCK_CHECK_INTERVAL
+          = Duration.ofMinutes(5);
+  // Amount of time to wait before logging each individual block replica
+  // as warning.
+  private static final Duration LONG_UNDER_CONSTRUCTION_BLOCK_WARN_THRESHOLD
+      = Duration.ofHours(2);
+  private static final Duration LONG_UNDER_CONSTRUCTION_BLOCK_WARN_INTERVAL
+      = Duration.ofMinutes(30);
+
+  private final Map<Block, Set<BlockReplica>> replicasByBlockId =
+      Maps.newHashMap();
+  private final boolean enabled;
+  private int count = 0;
+  // DatanodeAdminMonitor invokes logWarningForLongUnderConstructionBlocks 
every 30 seconds.
+  // To reduce the number of times this method loops through the Under 
Construction blocks,
+  // the interval is limited by LONG_UNDER_CONSTRUCTION_BLOCK_CHECK_INTERVAL.
+  private Instant nextWarnLogCheck =
+      Instant.now().plus(LONG_UNDER_CONSTRUCTION_BLOCK_CHECK_INTERVAL);
+
+  static class BlockReplica {
+    private final Block block;
+    private final DatanodeDescriptor dn;
+    private final Instant firstReportedTime;
+    private Instant nextWarnLog;
+
+    BlockReplica(Block block,
+                 DatanodeDescriptor dn) {
+      this.block = block;
+      this.dn = dn;
+      this.firstReportedTime = Instant.now();
+      this.nextWarnLog = 
firstReportedTime.plus(LONG_UNDER_CONSTRUCTION_BLOCK_WARN_THRESHOLD);
+    }
+
+    Block getBlock() {
+      return block;
+    }
+
+    DatanodeDescriptor getDatanode() {
+      return dn;
+    }
+
+    boolean shouldLogWarning() {
+      if (Instant.now().isBefore(nextWarnLog)) {
+        return false;
+      }
+      nextWarnLog = 
Instant.now().plus(LONG_UNDER_CONSTRUCTION_BLOCK_WARN_INTERVAL);
+      return true;
+    }
+
+    Duration getDurationSinceReporting() {
+      return Duration.between(firstReportedTime, Instant.now());
+    }
+
+    @Override
+    public String toString() {
+      return String.format("ReportedBlockInfo [block=%s, dn=%s]", block, dn);
+    }
+  }
+
+  UnderConstructionBlocks(Configuration conf) {
+    this.enabled = conf.getBoolean(
+        DFSConfigKeys.DFS_DECOMMISSION_TRACK_UNDER_CONSTRUCTION_BLOCKS,
+        
DFSConfigKeys.DFS_DECOMMISSION_TRACK_UNDER_CONSTRUCTION_BLOCKS_DEFAULT);
+    if (enabled) {
+      LOG.info("Tracking Under Construction blocks for DatanodeAdminManager");
+    } else {
+      LOG.debug("DatanodeAdminManager will not track Under Construction 
blocks");
+    }
+  }
+
+  /**
+   * Remove an Under Construction block replica.
+   * This method is called when an Under Construction block replica
+   * transitions from UC state to states like: finalized/complete,
+   * corrupt, invalidated, and deleted.
+   */
+  void removeUcBlock(DatanodeDescriptor reportingNode, Block reportedBlock) {
+    if (!enabled) {
+      return;
+    }
+    if (reportingNode == null || reportedBlock == null) {
+      LOG.warn("Unexpected null input {} , {}", reportingNode, reportedBlock);
+      return;
+    }
+    try {
+      Set<BlockReplica> replicas;
+      if (BlockIdManager.isStripedBlockID(reportedBlock.getBlockId())) {
+        Block blkId = new Block(BlockIdManager.convertToStripedID(reportedBlock
+                .getBlockId()));
+        replicas = getBlockReplicas(blkId);
+      } else {
+        reportedBlock = new Block(reportedBlock);
+        replicas = getBlockReplicas(reportedBlock);
+      }
+      if (replicas.isEmpty()) {
+        replicasByBlockId.remove(reportedBlock);
+        LOG.debug("UC block {} not found on {}, total is [replicas={} / 
blocks={}]",
+                reportedBlock, reportingNode, count, replicasByBlockId.size());
+      } else {
+        removeUcBlockFromSet(reportingNode, reportedBlock, replicas);
+      }
+    } catch (Exception e) {
+      // Observed during testing that exception thrown here
+      // are caught & never logged
+      LOG.error("Remove UnderConstruction block {} {} failed",
+          reportedBlock, reportingNode, e);
+    }
+  }
+
+  private void removeUcBlockFromSet(DatanodeDescriptor reportingNode,
+                                    Block reportedBlock,
+                                    Set<BlockReplica> storedReplicasForBlock) {
+    final List<BlockReplica> storedBlocks = storedReplicasForBlock.stream()
+            .filter(replica -> reportingNode.equals(replica.getDatanode())
+                && reportedBlock.getGenerationStamp() >= 
replica.getBlock().getGenerationStamp())
+            .collect(Collectors.toList());
+    storedReplicasForBlock.removeIf(replica -> 
reportingNode.equals(replica.getDatanode())
+        && reportedBlock.getGenerationStamp() >= 
replica.getBlock().getGenerationStamp());
+    if (storedReplicasForBlock.isEmpty()) {
+      replicasByBlockId.remove(reportedBlock);
+    }
+    final String storedBlockString = storedBlocks.stream()
+            .map(br -> br.getBlock().toString())
+            .collect(Collectors.joining(","));
+    if (storedBlocks.size() > 1) {
+      LOG.warn("Removed multiple UC block [{}->{}] from {}, total is 
[replicas={} / blocks={}]",

Review Comment:
   +1, I will add a detailed comment:
   
   ```
         // Multiple replicas were removed from the reportingNode. This should 
never occur
         // because each UC replica should only have one copy stored in 
"replicasByBlockId".
         // Log a warning for this unexpected case.
   ```





> HDFS decommissioning does not consider if Under Construction blocks are 
> sufficiently replicated which causes HDFS Data Loss
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-17658
>                 URL: https://issues.apache.org/jira/browse/HDFS-17658
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>    Affects Versions: 3.4.0
>            Reporter: Kevin Wikant
>            Priority: Major
>              Labels: pull-request-available
>
> h2. Background
> The HDFS Namenode manages datanode decommissioning using the 
> DatanodeAdminManager
> The DatanodeAdminManager has logic to prevent transitioning datanodes to 
> decommissioned state if they contain open blocks (i.e. Under Construction 
> blocks) which are not sufficiently replicated to other datanodes. In the 
> context of decomissioning, the reason that open blocks are important is 
> because they cannot be replicated to other datanodes given they are actively 
> being appended by an HDFS client. Because open blocks cannot be moved during 
> decommissioning, they should prevent the associated datanodes from becoming 
> decommissioned until the block is completed/finalized and it can be safely 
> moved to other live datanode(s).
>  
> h2. Problem
> The logic for DatanodeAdminManager to avoid decommissioning datanodes which 
> contain open blocks does not properly consider blocks which are in Under 
> Construction state. The DatanodeAdminManager is only considering blocks which 
> are in committed/finalized state. For reference:
>  * a block which is actively being appended by a DFSOutputStream is in Under 
> Construction state
>  * only when a DFSOutputStream is closed does the block transition from Under 
> Construction state to Committed/Finalized state
>  * then later when the block is reported to the namenode in a block report, 
> it will transition from Committed to Completed state
> Furthermore:
>  * this is true for a new file/block that was just created via 
> "DFSClient.create"
>  * this is true for an existing file/block that was just opened via 
> "DFSClient.append"
>  * this is true for all different dfs.replication factor values
>  
> h2. Impact
> In some environments, a datanode being decommissioned is taken as a signal 
> that all the blocks on that datanodes are sufficiently replicated to other 
> live datanodes & therefore it is safe to terminate the underlying virtual 
> host running the datanode.
> These types of environments can be impacted by HDFS data loss for open blocks 
> which are not considered in the datanode decommissioning process. Since open 
> blocks are not considered in determining if a datanode can be decommissioned, 
> a datanode may become decommissioned before its open blocks are replicated to 
> other datanodes. If the decommissioned datanode is then terminated, the open 
> blocks will be lost.
> For open blocks with replication of 1, it takes a single 
> decommissioned/terminated datanode to cause data loss. For open blocks with 
> greater replication, all the datanodes holding the open block must be 
> decommissioned/terminated for there to be data loss.
> I would also break the impact down into 2 cases:
>  * for a new HDFS block that has never been closed, arguably if the 
> DFSOutputStream encounters failure then the client should be able to replay 
> the data from source
>  * for an existing HDFS block that has previously been closed, when this 
> block is opened via a new DFSOutputStream the block is susceptible to being 
> lost & the client cannot replay data which was appended in the past by a 
> different client. This is arguably the worse case of impact.
>  
> h2. Testing
> This behaviour has been verified via testing on Hadoop 3.4.0; however, I 
> suspect it also applies to many other older/newer Hadoop versions.
> See JIRA comments for detailed test methodology & results.
> The following is a summary of the test cases & test results:
> {quote}*Test#1: Create Block & Repeatedly Append in Loop → Decommission 
> Datanode*
> ^ Expectation: block is considered during decommissioning; datanode is not 
> decommissioned until the write operation is finished & block is replicated to 
> another datanode.
> ^ Observation: block is not considered during decommissioning & is lost when 
> decommissioned data is terminated.
> *Test#2: Create Block & Repeatedly Append in Loop → Close DFSOutputStream → 
> Decommission Datanode*
> ^ Expectation: block is considered during decommissioning & is replicated to 
> another datanode as part of decommissioning.
> ^ Observation: block is considered during decommissioning & is replicated to 
> another datanode as part of decommissioning.
> *Test#3: Create Block & Repeatedly Append in Loop → Close DFSOutputStream → 
> Re-open Block & Repeatedly Append in Loop → Decommission Datanode*
> ^ Expectation: block is considered during decommissioning; datanode is not 
> decommissioned until the write operation is finished & block is replicated to 
> another datanode.
> ^ Observation: block is not considered during decommissioning & is lost when 
> decommissioned data is terminated.
> *Test#4: Create Block & Repeatedly Append in Loop → Close DFSOutputStream → 
> Re-open Block & Repeatedly Append in Loop → Close DFSOutputStream → 
> Decommission Datanode*
> ^ Expectation: block is considered during decommissioning & is replicated to 
> another datanode as part of decommissioning.
> ^ Observation: block is not considered during decommissioning & is lost when 
> decommissioned data is terminated.
> {quote}
> These were all tested with replication factor of 1 & 2, observation is the 
> same in both cases.
>  
> h2. Root Cause Theory
> The DatanodeAdminManager relies on the DatanodeDescriptor StorageInfos to 
> identify which blocks to consider as part of wether or not a datanode can be 
> decommissioned.
> Based on an examination of the HDFS Namenode & Datanode DEBUG logs during 
> testing, I believe the root cause has to do with the following 2 behaviours:
> {{*1.* When a DFSOutputStream is created for a block, that block enters Under 
> Construction state but the Namenode does not add Under Construction blocks to 
> the StorageInfos unless they have been committed/finalized which only occurs 
> when the DFSOutputStream is closed. Therefore, by checking the StorageInfos 
> only, the DatanodeAdminManager is not actually checking Under Construction 
> blocks.}}
> During the testing, we see that the Under Construction blocks are not in the 
> StorageInfos for the corresponding DatanodeDescriptor:
> {quote}2024-11-05 14:32:19,206 DEBUG BlockStateChange: BLOCK* block 
> RECEIVING_BLOCK: blk_1073741825_1001 is received from 172.31.93.123:9866
> ...
> 2024-11-05 14:36:02,805 INFO blockmanagement.DatanodeAdminManager: Starting 
> decommission of 172.31.93.123:9866 
> [DISK]DS-bb1d316c-a47a-4a3a-bf6e-0781945a50d1:NORMAL:172.31.93.123:9866 with 
> 0 blocks
> 2024-11-05 14:36:02,805 INFO blockmanagement.DatanodeAdminManager: Starting 
> decommission of 172.31.93.123:9866 
> [DISK]DS-95cfd2fa-25b0-4b20-aacf-e259201cf2eb:NORMAL:172.31.93.123:9866 with 
> 0 blocks
> {quote}
> Whereas, if the DFSOutputStream is closed, we see the block is included in 
> the StorageInfos:
> {quote}2024-11-05 14:49:49,563 DEBUG BlockStateChange: BLOCK* block 
> RECEIVED_BLOCK: blk_1073741825_1001 is received from 172.31.95.159:9866
> ...
> 2024-11-05 14:52:36,770 INFO blockmanagement.DatanodeAdminManager: Starting 
> decommission of 172.31.95.159:9866 
> [DISK]DS-07f2256b-0cbc-4b5e-9c6c-9108650ee896:NORMAL:172.31.95.159:9866 with 
> 0 blocks
> 2024-11-05 14:52:36,770 INFO blockmanagement.DatanodeAdminManager: Starting 
> decommission of 172.31.95.159:9866 
> [DISK]DS-91919280-df53-4dac-b983-0eb12c81e4bf:NORMAL:172.31.95.159:9866 with 
> 1 blocks
> 2024-11-05 14:52:48,231 INFO BlockStateChange: Block: blk_1073741825_1001, 
> Expected Replicas: 1, live replicas: 0, corrupt replicas: 0, decommissioned 
> replicas: 0, decommissioning replicas: 1, maintenance replicas: 0, live 
> entering maintenance replicas: 0, replicas on stale nodes:0, readonly 
> replicas: 0, excess replicas: 0, Is Open File: false, Datanodes having this 
> block: 172.31.95.159:9866 , Current Datanode: 172.31.95.159:9866, Is current 
> datanode decommissioning: true, Is current datanode entering maintenance: 
> false
> {quote}
> I think this behaviour might be related to the following code which has been 
> in Hadoop for the past 10+ years: 
> [https://github.com/apache/hadoop/blob/51ebc3c20e8ae7d4dced41cdd2f52715aea604cc/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java#L3708]
>  
>  
> {{*2.* When a DFSOutputStream is created for an existing block, the Namenode 
> marks the existing block with previous generation stamp as stale & removes it 
> from the StorageInfos.}}
> We can see the following DEBUG logs during repro testing:
> {quote}2024-11-05 15:47:24,131 INFO namenode.FSNamesystem: 
> updatePipeline(blk_1073741825_1001, newGS=1002, newLength=307200, 
> newNodes=[172.31.95.208:9866], client=DFSClient_NONMAPREDUCE_-1408310900_1)
> 2024-11-05 15:47:24,132 DEBUG BlockStateChange: BLOCK* Removing stale replica 
> ReplicaUC[[DISK]DS-1bd20290-4662-4e5e-af0b-9b644d78d4f8:NORMAL:172.31.95.208:9866|RBW]
>  of blk_1073741825_1001
> 2024-11-05 15:47:24,132 INFO namenode.FSNamesystem: 
> updatePipeline(blk_1073741825_1001 => blk_1073741825_1002) success
> {quote}
> Tracing the code from the logs we see where this occurs:
>  * 
> [https://github.com/apache/hadoop/blob/f7651e2f63ddba9ed1ae4052e38464f85dd445f0/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java#L4476]
>  * 
> [https://github.com/apache/hadoop/blob/f7651e2f63ddba9ed1ae4052e38464f85dd445f0/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java#L4433]
>  * 
> [https://github.com/apache/hadoop/blob/f7651e2f63ddba9ed1ae4052e38464f85dd445f0/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlocksMap.java#L202]
>  
>  
> h3. Potential Solution
> If possible, can consider adding Under Construction blocks to StorageInfos. 
> If this would have other adverse impacts, then another solution is to expose 
> the Under Construction blocks to the DatanodeAdminManager via another 
> separate data structure.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-17658) HDFS decommissioning does not consider if Under Construction blocks are sufficiently replicated which causes HDFS Data Loss

Reply via email to