[
https://issues.apache.org/jira/browse/HDFS-17453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17834907#comment-17834907
]
ASF GitHub Bot commented on HDFS-17453:
---------------------------------------
hadoop-yetus commented on PR #6708:
URL: https://github.com/apache/hadoop/pull/6708#issuecomment-2042623902
:broken_heart: **-1 overall**
| Vote | Subsystem | Runtime | Logfile | Comment |
|:----:|----------:|--------:|:--------:|:-------:|
| +0 :ok: | reexec | 0m 30s | | Docker mode activated. |
|||| _ Prechecks _ |
| +1 :green_heart: | dupname | 0m 0s | | No case conflicting files
found. |
| +0 :ok: | codespell | 0m 1s | | codespell was not available. |
| +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available.
|
| +1 :green_heart: | @author | 0m 0s | | The patch does not contain
any @author tags. |
| +1 :green_heart: | test4tests | 0m 0s | | The patch appears to
include 3 new or modified test files. |
|||| _ trunk Compile Tests _ |
| -1 :x: | mvninstall | 11m 13s |
[/branch-mvninstall-root.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6708/6/artifact/out/branch-mvninstall-root.txt)
| root in trunk failed. |
| +1 :green_heart: | compile | 1m 12s | | trunk passed with JDK
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 |
| +1 :green_heart: | compile | 1m 6s | | trunk passed with JDK
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 |
| +1 :green_heart: | checkstyle | 1m 5s | | trunk passed |
| +1 :green_heart: | mvnsite | 1m 16s | | trunk passed |
| +1 :green_heart: | javadoc | 1m 3s | | trunk passed with JDK
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 |
| +1 :green_heart: | javadoc | 1m 34s | | trunk passed with JDK
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 |
| +1 :green_heart: | spotbugs | 3m 23s | | trunk passed |
| +1 :green_heart: | shadedclient | 39m 2s | | branch has no errors
when building and testing our client artifacts. |
|||| _ Patch Compile Tests _ |
| +1 :green_heart: | mvninstall | 1m 12s | | the patch passed |
| +1 :green_heart: | compile | 1m 12s | | the patch passed with JDK
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 |
| +1 :green_heart: | javac | 1m 12s | | the patch passed |
| +1 :green_heart: | compile | 1m 10s | | the patch passed with JDK
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 |
| +1 :green_heart: | javac | 1m 10s | | the patch passed |
| +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks
issues. |
| +1 :green_heart: | checkstyle | 0m 58s | | the patch passed |
| +1 :green_heart: | mvnsite | 1m 10s | | the patch passed |
| +1 :green_heart: | javadoc | 0m 51s | | the patch passed with JDK
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 |
| +1 :green_heart: | javadoc | 1m 33s | | the patch passed with JDK
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 |
| +1 :green_heart: | spotbugs | 3m 19s | | the patch passed |
| +1 :green_heart: | shadedclient | 38m 57s | | patch has no errors
when building and testing our client artifacts. |
|||| _ Other Tests _ |
| -1 :x: | unit | 231m 41s |
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6708/6/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
| hadoop-hdfs in the patch passed. |
| +1 :green_heart: | asflicense | 0m 47s | | The patch does not
generate ASF License warnings. |
| | | 345m 12s | | |
| Reason | Tests |
|-------:|:------|
| Failed junit tests | hadoop.hdfs.server.datanode.TestBlockReplacement |
| | hadoop.hdfs.server.namenode.ha.TestHASafeMode |
| | hadoop.hdfs.server.blockmanagement.TestPendingDataNodeMessages |
| Subsystem | Report/Notes |
|----------:|:-------------|
| Docker | ClientAPI=1.45 ServerAPI=1.45 base:
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6708/6/artifact/out/Dockerfile
|
| GITHUB PR | https://github.com/apache/hadoop/pull/6708 |
| Optional Tests | dupname asflicense compile javac javadoc mvninstall
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
| uname | Linux 776c2e26b89e 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9
15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | dev-support/bin/hadoop.sh |
| git revision | trunk / 181bccb1ca5c0bd6755ae2a03058cd93bd1cb338 |
| Default Java | Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 |
| Multi-JDK versions |
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1
/usr/lib/jvm/java-8-openjdk-amd64:Private
Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 |
| Test Results |
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6708/6/testReport/ |
| Max. process+thread count | 4432 (vs. ulimit of 5500) |
| modules | C: hadoop-hdfs-project/hadoop-hdfs U:
hadoop-hdfs-project/hadoop-hdfs |
| Console output |
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6708/6/console |
| versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 |
| Powered by | Apache Yetus 0.14.0 https://yetus.apache.org |
This message was automatically generated.
> IncrementalBlockReport can have race condition with Edit Log Tailer
> -------------------------------------------------------------------
>
> Key: HDFS-17453
> URL: https://issues.apache.org/jira/browse/HDFS-17453
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: auto-failover, ha, hdfs, namenode
> Affects Versions: 3.3.0, 3.3.1, 2.10.2, 3.3.2, 3.3.5, 3.3.4, 3.3.6
> Reporter: Danny Becker
> Assignee: Danny Becker
> Priority: Major
> Labels: pull-request-available
>
> h2. Summary
> There is a race condition between IncrementalBlockReports (IBR) and
> EditLogTailer in Standby NameNode (SNN) which can lead to leaked IBRs and
> false corrupt blocks after HA Failover. The race condition occurs when the
> SNN loads the edit logs before it receives the block reports from DataNode
> (DN).
> h2. Example
> In the following example there is a block (b1) with 3 generation stamps (gs1,
> gs2, gs3).
> # SNN1 loads edit logs for b1gs1 and b1gs2.
> # DN1 sends the IBR for b1gs1 to SNN1.
> # SNN1 will determine that the reported block b1gs1 from DN1 is corrupt and
> it will be queued for later.
> [BlockManager.java|https://github.com/apache/hadoop/blob/6ed73896f6e8b4b7c720eff64193cb30b3e77fb2/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java#L3447C1-L3464C6]
> {code:java}
> BlockToMarkCorrupt c = checkReplicaCorrupt(
> block, reportedState, storedBlock, ucState, dn);
> if (c != null) {
> if (shouldPostponeBlocksFromFuture) {
> // If the block is an out-of-date generation stamp or state,
> // but we're the standby, we shouldn't treat it as corrupt,
> // but instead just queue it for later processing.
> // Storing the reported block for later processing, as that is what
> // comes from the IBR / FBR and hence what we should use to compare
> // against the memory state.
> // See HDFS-6289 and HDFS-15422 for more context.
> queueReportedBlock(storageInfo, block, reportedState,
> QUEUE_REASON_CORRUPT_STATE);
> } else {
> toCorrupt.add(c);
> }
> return storedBlock;
> } {code}
> # DN1 sends IBR for b1gs2 and b1gs3 to SNN1.
> # SNN1 processes b1sg2 and updates the blocks map.
> # SNN1 queues b1gs3 for later because it determines that b1gs3 is a future
> genstamp.
> # SNN1 loads b1gs3 edit logs and processes the queued reports for b1.
> # SNN1 processes b1gs1 first and puts it back in the queue.
> # SNN1 processes b1gs3 next and updates the blocks map.
> # Later, SNN1 becomes the Active NameNode (ANN) during an HA Failover.
> # SNN1 will catch to the latest edit logs, then process all queued block
> reports to become the ANN.
> # ANN1 will process b1gs1 and mark it as corrupt.
> If the example above happens for every DN which stores b1, then when the HA
> failover happens, b1 will be incorrectly marked as corrupt. This will be
> fixed when the first DN sends a FullBlockReport or an IBR for b1.
> h2. Logs from Active Cluster
> I added the following logs to confirm this issue in an active cluster:
> {code:java}
> BlockToMarkCorrupt c = checkReplicaCorrupt(
> block, reportedState, storedBlock, ucState, dn);
> if (c != null) {
> DatanodeStorageInfo storedStorageInfo = storedBlock.findStorageInfo(dn);
> LOG.info("Found corrupt block {} [{}, {}] from DN {}. Stored block {} from
> DN {}",
> block, reportedState.name(), ucState.name(), storageInfo, storedBlock,
> storedStorageInfo);
> if (storageInfo.equals(storedStorageInfo) &&
> storedBlock.getGenerationStamp() > block.getGenerationStamp()) {
> LOG.info("Stored Block {} from the same DN {} has a newer GenStamp." +
> storedBlock, storedStorageInfo);
> }
> if (shouldPostponeBlocksFromFuture) {
> // If the block is an out-of-date generation stamp or state,
> // but we're the standby, we shouldn't treat it as corrupt,
> // but instead just queue it for later processing.
> // Storing the reported block for later processing, as that is what
> // comes from the IBR / FBR and hence what we should use to compare
> // against the memory state.
> // See HDFS-6289 and HDFS-15422 for more context.
> queueReportedBlock(storageInfo, block, reportedState,
> QUEUE_REASON_CORRUPT_STATE);
> LOG.info("Queueing the block {} for later processing", block);
> } else {
> toCorrupt.add(c);
> LOG.info("Marking the block {} as corrupt", block);
> }
> return storedBlock;
> } {code}
>
> Logs from nn1 (Active):
> {code:java}
> 2024-04-03T03:00:52.524-0700,INFO,[IPC Server handler 6 on default port
> 443],org.apache.hadoop.hdfs.server.namenode.FSNamesystem,"updatePipeline(blk_66092666802_65700910634,
> newGS=65700925027, newLength=10485760, newNodes=[[DN1]:10010, [DN2]:10010,
> [DN3]:10010, client=client1)"
> 2024-04-03T03:00:52.539-0700,INFO,[IPC Server handler 6 on default port
> 443],org.apache.hadoop.hdfs.server.namenode.FSNamesystem,"updatePipeline(blk_66092666802_65700910634
> => blk_66092666802_65700925027) success"
> 2024-04-03T03:01:07.413-0700,INFO,[IPC Server handler 6 on default port
> 443],org.apache.hadoop.hdfs.server.namenode.FSNamesystem,"updatePipeline(blk_66092666802_65700925027,
> newGS=65700933553, newLength=20971520, newNodes=[[DN1]:10010, [DN2]:10010,
> [DN3]:10010, client=client1)"
> 2024-04-03T03:01:07.413-0700,INFO,[IPC Server handler 6 on default port
> 443],org.apache.hadoop.hdfs.server.namenode.FSNamesystem,"updatePipeline(blk_66092666802_65700925027
> => blk_66092666802_65700933553) success" {code}
>
> Logs from nn2 (Standby):
> {code:java}
> 2024-04-03T03:01:23.067-0700,INFO,[Block report
> processor],org.apache.hadoop.hdfs.server.blockmanagement.BlockManager,"Found
> corrupt block blk_66092666802_65700925027 [FINALIZED, COMPLETE] from DN
> [DISK]DS-1:NORMAL:[DN1]:10010. Stored block blk_66092666802_65700933553 from
> DN null"
> 2024-04-03T03:01:23.067-0700,INFO,[Block report
> processor],org.apache.hadoop.hdfs.server.blockmanagement.BlockManager,"Queueing
> the block blk_66092666802_65700925027 for later processing"
> 2024-04-03T03:01:24.159-0700,INFO,[Block report
> processor],org.apache.hadoop.hdfs.server.blockmanagement.BlockManager,"Found
> corrupt block blk_66092666802_65700925027 [FINALIZED, COMPLETE] from DN
> [DISK]DS-3:NORMAL:[DN3]:10010. Stored block blk_66092666802_65700933553 from
> DN null"
> 2024-04-03T03:01:24.159-0700,INFO,[Block report
> processor],org.apache.hadoop.hdfs.server.blockmanagement.BlockManager,"Queueing
> the block blk_66092666802_65700925027 for later processing"
> 2024-04-03T03:01:24.159-0700,INFO,[Block report
> processor],org.apache.hadoop.hdfs.server.blockmanagement.BlockManager,"Found
> corrupt block blk_66092666802_65700925027 [FINALIZED, COMPLETE] from DN
> [DISK]DS-2:NORMAL:[DN2]:10010. Stored block blk_66092666802_65700933553 from
> DN null"
> 2024-04-03T03:01:24.159-0700,INFO,[Block report
> processor],org.apache.hadoop.hdfs.server.blockmanagement.BlockManager,"Queueing
> the block blk_66092666802_65700925027 for later processing" {code}
>
> Logs from nn2 when it transitions to Active:
> {code:java}
> 2024-04-03T15:39:09.050-0700,INFO,[IPC Server handler 8 on default port
> 8020],org.apache.hadoop.hdfs.server.blockmanagement.BlockManager,"Found
> corrupt block blk_66092666802_65700925027 [FINALIZED, COMPLETE] from DN
> [DISK]DS-1:NORMAL:[DN1]:10010. Stored block blk_66092666802_65700933553 from
> DN [DISK]DS-1:NORMAL:[DN1]:10010"
> 2024-04-03T15:39:09.050-0700,INFO,[IPC Server handler 8 on default port
> 8020],org.apache.hadoop.hdfs.server.blockmanagement.BlockManager,"Stored
> Block blk_66092666802_65700933553 from the same DN
> [DISK]DS-1:NORMAL:[DN1]:10010 has a newer GenStamp."
> 2024-04-03T15:39:09.050-0700,INFO,[IPC Server handler 8 on default port
> 8020],org.apache.hadoop.hdfs.server.blockmanagement.BlockManager,"Marking the
> block blk_66092666802_65700925027 as corrupt"
> 2024-04-03T15:39:09.050-0700,INFO,[IPC Server handler 8 on default port
> 8020],org.apache.hadoop.hdfs.server.blockmanagement.BlockManager,"Found
> corrupt block blk_66092666802_65700925027 [FINALIZED, COMPLETE] from DN
> [DISK]DS-2:NORMAL:[DN2]:10010. Stored block blk_66092666802_65700933553 from
> DN [DISK]DS-2:NORMAL:[DN2]:10010"
> 2024-04-03T15:39:09.050-0700,INFO,[IPC Server handler 8 on default port
> 8020],org.apache.hadoop.hdfs.server.blockmanagement.BlockManager,"Stored
> Block blk_66092666802_65700933553 from the same DN
> [DISK]DS-2:NORMAL:[DN2]:10010 has a newer GenStamp."
> 2024-04-03T15:39:09.050-0700,INFO,[IPC Server handler 8 on default port
> 8020],org.apache.hadoop.hdfs.server.blockmanagement.BlockManager,"Marking the
> block blk_66092666802_65700925027 as corrupt"
> 2024-04-03T15:39:09.050-0700,INFO,[IPC Server handler 8 on default port
> 8020],org.apache.hadoop.hdfs.server.blockmanagement.BlockManager,"Found
> corrupt block blk_66092666802_65700925027 [FINALIZED, COMPLETE] from DN
> [DISK]DS-3:NORMAL:[DN3]:10010. Stored block blk_66092666802_65700933553 from
> DN [DISK]DS-3:NORMAL:[DN3]:10010"
> 2024-04-03T15:39:09.050-0700,INFO,[IPC Server handler 8 on default port
> 8020],org.apache.hadoop.hdfs.server.blockmanagement.BlockManager,"Stored
> Block blk_66092666802_65700933553 from the same DN
> [DISK]DS-3:NORMAL:[DN3]:10010 has a newer GenStamp."
> 2024-04-03T15:39:09.050-0700,INFO,[IPC Server handler 8 on default port
> 8020],org.apache.hadoop.hdfs.server.blockmanagement.BlockManager,"Marking the
> block blk_66092666802_65700925027 as corrupt"
> {code}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]