[jira] [Commented] (HDFS-17050) Erasure coding: fix bug for invalidating duplicated block when two ec block at the same datanode but different storage.

ASF GitHub Bot (Jira) Tue, 20 Jun 2023 02:43:04 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-17050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17735212#comment-17735212
 ]


ASF GitHub Bot commented on HDFS-17050:
---------------------------------------

zhangshuyan0 commented on code in PR #5753:
URL: https://github.com/apache/hadoop/pull/5753#discussion_r1234996837


##########
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockInfoStriped.java:
##########
@@ -128,8 +128,25 @@ boolean addStorage(DatanodeStorageInfo storage, Block 
reportedBlock) {
     DatanodeStorageInfo old = getStorageInfo(index);
     if (old != null && !old.equals(storage)) { // over replicated
       // check if the storage has been stored
+      boolean blockIdNotEquals = false;
+      long blockGroupId = 
BlockIdManager.convertToStripedID(reportedBlock.getBlockId() - blockIndex);
+      Iterator<BlockInfo> blockIterator = old.getBlockIterator();
+      while (blockIterator.hasNext()) {
+        BlockInfo blockInfo = blockIterator.next();
+        if (!blockInfo.isStriped()) {
+          continue;
+        } else {
+          if (BlockIdManager.convertToStripedID(blockInfo.getBlockId()) == 
blockGroupId) {
+            Block blockOnOldStorage = ((BlockInfoStriped) 
blockInfo).getBlockOnStorage(old);
+            if (blockOnOldStorage.getBlockId() != reportedBlock.getBlockId()) {
+              blockIdNotEquals = true;

Review Comment:
   Sorry, I'm a bit confused here. We retrieve `old` using `index`, How can 
`blockOnOldStorage.getBlockId()` and `reportedBlock.getBlockId()` be different?



##########
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/JvmPauseMonitor.java:
##########
@@ -195,11 +195,11 @@ public void run() {
 
         if (extraSleepTime > warnThresholdMs) {
           ++numGcWarnThresholdExceeded;
-          LOG.warn(formatMessage(
+          LOG.debug(formatMessage(

Review Comment:
   Why does this need to be changed to debug?



##########
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java:
##########
@@ -4543,25 +4545,32 @@ public NumberReplicas countNodes(BlockInfo b) {
   NumberReplicas countNodes(BlockInfo b, boolean inStartupSafeMode) {
     NumberReplicas numberReplicas = new NumberReplicas();
     Collection<DatanodeDescriptor> nodesCorrupt = corruptReplicas.getNodes(b);
+    HashSet<DatanodeDescriptor> alreadyCorruptSet = null;
     if (b.isStriped()) {
+      alreadyCorruptSet = new HashSet<>();
       countReplicasForStripedBlock(numberReplicas, (BlockInfoStriped) b,
-          nodesCorrupt, inStartupSafeMode);
+          nodesCorrupt, inStartupSafeMode, alreadyCorruptSet);
     } else {
       for (DatanodeStorageInfo storage : blocksMap.getStorages(b)) {
         checkReplicaOnStorage(numberReplicas, b, storage, nodesCorrupt,
-            inStartupSafeMode);
+            inStartupSafeMode, alreadyCorruptSet);
       }
     }
     return numberReplicas;
   }
 
   private StoredReplicaState checkReplicaOnStorage(NumberReplicas counters,
       BlockInfo b, DatanodeStorageInfo storage,
-      Collection<DatanodeDescriptor> nodesCorrupt, boolean inStartupSafeMode) {
+      Collection<DatanodeDescriptor> nodesCorrupt, boolean inStartupSafeMode,
+      HashSet<DatanodeDescriptor> alreadyCorrupt) {
     final StoredReplicaState s;
     if (storage.getState() == State.NORMAL) {
       final DatanodeDescriptor node = storage.getDatanodeDescriptor();
-      if (nodesCorrupt != null && nodesCorrupt.contains(node)) {
+      if (nodesCorrupt != null && nodesCorrupt.contains(node) &&
+          (alreadyCorrupt == null || !alreadyCorrupt.contains(node))) {

Review Comment:
   What does this new added condition mean? I think the current implement of 
`corruptReplicasMap` can not distinguish between different internal blocks on 
the same datanode.





> Erasure coding: fix bug for invalidating duplicated block when two ec block 
> at the same datanode but different storage.
> -----------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-17050
>                 URL: https://issues.apache.org/jira/browse/HDFS-17050
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 3.4.0, 3.3.2
>            Reporter: farmmamba
>            Assignee: farmmamba
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: image-2023-06-16-15-32-05-123.png
>
>
> Currently, I found a strange phenomenon mentioned in HDFS-17047.
> When triggered FBR manually or automatically, we found some warn log like 
> below:
> {code:java}
> 2023-06-14 16:29:36,432 WARN BlockStateChange: BLOCK* addStoredBlock: block 
> blk_-9223372036578646784_59354864 moved to storageType DISK on node 
> datanode12:50010
> 2023-06-14 16:29:36,477 WARN BlockStateChange: BLOCK* addStoredBlock: block 
> blk_-9223372036578646784_59354864 moved to storageType DISK on node 
> datanode12:50010{code}
> The above logs print the same storedBlock two times. After diving into logs, 
> I found that there exist two blocks of a same block group due to some unknown 
> reasons. And one of the two blocks is also exists in other datanode. But fsck 
> did not print the duplicated replicas info.
> additional information： the file is 3MB+,  we use RS-6-3-1024K, so the fsck 
> only print seven blocks information. But indeed, we have eight blocks and one 
> of them is a duplicated block.
>  
> The reason why print above logs is that:
> In BlockManager#addStoredBlock method, because a datanode has two blocks of 
> the same block group, the AddBlockResult would be REPLACED.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-17050) Erasure coding: fix bug for invalidating duplicated block when two ec block at the same datanode but different storage.

Reply via email to