[
https://issues.apache.org/jira/browse/HDFS-17050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17735212#comment-17735212
]
ASF GitHub Bot commented on HDFS-17050:
---------------------------------------
zhangshuyan0 commented on code in PR #5753:
URL: https://github.com/apache/hadoop/pull/5753#discussion_r1234996837
##########
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockInfoStriped.java:
##########
@@ -128,8 +128,25 @@ boolean addStorage(DatanodeStorageInfo storage, Block
reportedBlock) {
DatanodeStorageInfo old = getStorageInfo(index);
if (old != null && !old.equals(storage)) { // over replicated
// check if the storage has been stored
+ boolean blockIdNotEquals = false;
+ long blockGroupId =
BlockIdManager.convertToStripedID(reportedBlock.getBlockId() - blockIndex);
+ Iterator<BlockInfo> blockIterator = old.getBlockIterator();
+ while (blockIterator.hasNext()) {
+ BlockInfo blockInfo = blockIterator.next();
+ if (!blockInfo.isStriped()) {
+ continue;
+ } else {
+ if (BlockIdManager.convertToStripedID(blockInfo.getBlockId()) ==
blockGroupId) {
+ Block blockOnOldStorage = ((BlockInfoStriped)
blockInfo).getBlockOnStorage(old);
+ if (blockOnOldStorage.getBlockId() != reportedBlock.getBlockId()) {
+ blockIdNotEquals = true;
Review Comment:
Sorry, I'm a bit confused here. We retrieve `old` using `index`, How can
`blockOnOldStorage.getBlockId()` and `reportedBlock.getBlockId()` be different?
##########
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/JvmPauseMonitor.java:
##########
@@ -195,11 +195,11 @@ public void run() {
if (extraSleepTime > warnThresholdMs) {
++numGcWarnThresholdExceeded;
- LOG.warn(formatMessage(
+ LOG.debug(formatMessage(
Review Comment:
Why does this need to be changed to debug?
##########
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java:
##########
@@ -4543,25 +4545,32 @@ public NumberReplicas countNodes(BlockInfo b) {
NumberReplicas countNodes(BlockInfo b, boolean inStartupSafeMode) {
NumberReplicas numberReplicas = new NumberReplicas();
Collection<DatanodeDescriptor> nodesCorrupt = corruptReplicas.getNodes(b);
+ HashSet<DatanodeDescriptor> alreadyCorruptSet = null;
if (b.isStriped()) {
+ alreadyCorruptSet = new HashSet<>();
countReplicasForStripedBlock(numberReplicas, (BlockInfoStriped) b,
- nodesCorrupt, inStartupSafeMode);
+ nodesCorrupt, inStartupSafeMode, alreadyCorruptSet);
} else {
for (DatanodeStorageInfo storage : blocksMap.getStorages(b)) {
checkReplicaOnStorage(numberReplicas, b, storage, nodesCorrupt,
- inStartupSafeMode);
+ inStartupSafeMode, alreadyCorruptSet);
}
}
return numberReplicas;
}
private StoredReplicaState checkReplicaOnStorage(NumberReplicas counters,
BlockInfo b, DatanodeStorageInfo storage,
- Collection<DatanodeDescriptor> nodesCorrupt, boolean inStartupSafeMode) {
+ Collection<DatanodeDescriptor> nodesCorrupt, boolean inStartupSafeMode,
+ HashSet<DatanodeDescriptor> alreadyCorrupt) {
final StoredReplicaState s;
if (storage.getState() == State.NORMAL) {
final DatanodeDescriptor node = storage.getDatanodeDescriptor();
- if (nodesCorrupt != null && nodesCorrupt.contains(node)) {
+ if (nodesCorrupt != null && nodesCorrupt.contains(node) &&
+ (alreadyCorrupt == null || !alreadyCorrupt.contains(node))) {
Review Comment:
What does this new added condition mean? I think the current implement of
`corruptReplicasMap` can not distinguish between different internal blocks on
the same datanode.
> Erasure coding: fix bug for invalidating duplicated block when two ec block
> at the same datanode but different storage.
> -----------------------------------------------------------------------------------------------------------------------
>
> Key: HDFS-17050
> URL: https://issues.apache.org/jira/browse/HDFS-17050
> Project: Hadoop HDFS
> Issue Type: Bug
> Affects Versions: 3.4.0, 3.3.2
> Reporter: farmmamba
> Assignee: farmmamba
> Priority: Major
> Labels: pull-request-available
> Attachments: image-2023-06-16-15-32-05-123.png
>
>
> Currently, I found a strange phenomenon mentioned in HDFS-17047.
> When triggered FBR manually or automatically, we found some warn log like
> below:
> {code:java}
> 2023-06-14 16:29:36,432 WARN BlockStateChange: BLOCK* addStoredBlock: block
> blk_-9223372036578646784_59354864 moved to storageType DISK on node
> datanode12:50010
> 2023-06-14 16:29:36,477 WARN BlockStateChange: BLOCK* addStoredBlock: block
> blk_-9223372036578646784_59354864 moved to storageType DISK on node
> datanode12:50010{code}
> The above logs print the same storedBlock two times. After diving into logs,
> I found that there exist two blocks of a same block group due to some unknown
> reasons. And one of the two blocks is also exists in other datanode. But fsck
> did not print the duplicated replicas info.
> additional information: the file is 3MB+, we use RS-6-3-1024K, so the fsck
> only print seven blocks information. But indeed, we have eight blocks and one
> of them is a duplicated block.
>
> The reason why print above logs is that:
> In BlockManager#addStoredBlock method, because a datanode has two blocks of
> the same block group, the AddBlockResult would be REPLACED.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]