[ https://issues.apache.org/jira/browse/HDFS-17710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17948675#comment-17948675 ]
ASF GitHub Bot commented on HDFS-17710: --------------------------------------- hadoop-yetus commented on PR #7296: URL: https://github.com/apache/hadoop/pull/7296#issuecomment-2845025341 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |:----:|----------:|--------:|:--------:|:-------:| | +0 :ok: | reexec | 0m 57s | | Docker mode activated. | |||| _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | |||| _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 36m 5s | | trunk passed | | +1 :green_heart: | compile | 1m 22s | | trunk passed with JDK Ubuntu-11.0.26+4-post-Ubuntu-1ubuntu120.04 | | +1 :green_heart: | compile | 1m 12s | | trunk passed with JDK Private Build-1.8.0_442-8u442-b06~us1-0ubuntu1~20.04-b06 | | +1 :green_heart: | checkstyle | 1m 9s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 16s | | trunk passed | | +1 :green_heart: | javadoc | 1m 9s | | trunk passed with JDK Ubuntu-11.0.26+4-post-Ubuntu-1ubuntu120.04 | | +1 :green_heart: | javadoc | 1m 42s | | trunk passed with JDK Private Build-1.8.0_442-8u442-b06~us1-0ubuntu1~20.04-b06 | | +1 :green_heart: | spotbugs | 3m 11s | | trunk passed | | +1 :green_heart: | shadedclient | 36m 43s | | branch has no errors when building and testing our client artifacts. | |||| _ Patch Compile Tests _ | | -1 :x: | mvninstall | 0m 59s | [/patch-mvninstall-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7296/3/artifact/out/patch-mvninstall-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch failed. | | -1 :x: | compile | 1m 11s | [/patch-compile-hadoop-hdfs-project_hadoop-hdfs-jdkUbuntu-11.0.26+4-post-Ubuntu-1ubuntu120.04.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7296/3/artifact/out/patch-compile-hadoop-hdfs-project_hadoop-hdfs-jdkUbuntu-11.0.26+4-post-Ubuntu-1ubuntu120.04.txt) | hadoop-hdfs in the patch failed with JDK Ubuntu-11.0.26+4-post-Ubuntu-1ubuntu120.04. | | -1 :x: | javac | 1m 11s | [/patch-compile-hadoop-hdfs-project_hadoop-hdfs-jdkUbuntu-11.0.26+4-post-Ubuntu-1ubuntu120.04.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7296/3/artifact/out/patch-compile-hadoop-hdfs-project_hadoop-hdfs-jdkUbuntu-11.0.26+4-post-Ubuntu-1ubuntu120.04.txt) | hadoop-hdfs in the patch failed with JDK Ubuntu-11.0.26+4-post-Ubuntu-1ubuntu120.04. | | -1 :x: | compile | 1m 3s | [/patch-compile-hadoop-hdfs-project_hadoop-hdfs-jdkPrivateBuild-1.8.0_442-8u442-b06~us1-0ubuntu1~20.04-b06.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7296/3/artifact/out/patch-compile-hadoop-hdfs-project_hadoop-hdfs-jdkPrivateBuild-1.8.0_442-8u442-b06~us1-0ubuntu1~20.04-b06.txt) | hadoop-hdfs in the patch failed with JDK Private Build-1.8.0_442-8u442-b06~us1-0ubuntu1~20.04-b06. | | -1 :x: | javac | 1m 3s | [/patch-compile-hadoop-hdfs-project_hadoop-hdfs-jdkPrivateBuild-1.8.0_442-8u442-b06~us1-0ubuntu1~20.04-b06.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7296/3/artifact/out/patch-compile-hadoop-hdfs-project_hadoop-hdfs-jdkPrivateBuild-1.8.0_442-8u442-b06~us1-0ubuntu1~20.04-b06.txt) | hadoop-hdfs in the patch failed with JDK Private Build-1.8.0_442-8u442-b06~us1-0ubuntu1~20.04-b06. | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 0m 56s | [/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7296/3/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs-project/hadoop-hdfs: The patch generated 2 new + 5 unchanged - 0 fixed = 7 total (was 5) | | -1 :x: | mvnsite | 1m 2s | [/patch-mvnsite-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7296/3/artifact/out/patch-mvnsite-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch failed. | | +1 :green_heart: | javadoc | 0m 57s | | the patch passed with JDK Ubuntu-11.0.26+4-post-Ubuntu-1ubuntu120.04 | | +1 :green_heart: | javadoc | 1m 32s | | the patch passed with JDK Private Build-1.8.0_442-8u442-b06~us1-0ubuntu1~20.04-b06 | | -1 :x: | spotbugs | 1m 1s | [/patch-spotbugs-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7296/3/artifact/out/patch-spotbugs-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch failed. | | -1 :x: | shadedclient | 15m 11s | | patch has errors when building and testing our client artifacts. | |||| _ Other Tests _ | | -1 :x: | unit | 1m 5s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7296/3/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch failed. | | +1 :green_heart: | asflicense | 0m 35s | | The patch does not generate ASF License warnings. | | | | 104m 3s | | | | Subsystem | Report/Notes | |----------:|:-------------| | Docker | ClientAPI=1.49 ServerAPI=1.49 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7296/3/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/7296 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux d784abc13dea 5.15.0-136-generic #147-Ubuntu SMP Sat Mar 15 15:53:30 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 4682ddfd4d3ba58527d5e8c001c40a79b9b12376 | | Default Java | Private Build-1.8.0_442-8u442-b06~us1-0ubuntu1~20.04-b06 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.26+4-post-Ubuntu-1ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_442-8u442-b06~us1-0ubuntu1~20.04-b06 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7296/3/testReport/ | | Max. process+thread count | 552 (vs. ulimit of 5500) | | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs | | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7296/3/console | | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 | | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org | This message was automatically generated. > Standby node can load unpersisted edit from JournalNode cache > ------------------------------------------------------------- > > Key: HDFS-17710 > URL: https://issues.apache.org/jira/browse/HDFS-17710 > Project: Hadoop HDFS > Issue Type: Bug > Components: journal-node > Affects Versions: 3.4.1 > Reporter: Adam Binford > Priority: Major > Labels: pull-request-available > > A standby or observer node can load edits from the journal node that failed > to be durably persisted. This can cause the standby or observer node to > incorrectly think that the last committed transaction ID is higher than it > actually is. This is the scenario that led us to find this: > We have three NameNodes, NN1, NN2, and NN3. NN1 is active, NN2 is standby, > and NN3 is observer. NN2 was failing to upload fsimage checkpoints to the > other NameNodes, for reasons we are still investigating. But because a > checkpoint was never able to be fully created, the JournalNodes could never > cleanup old edit files. This led all 3 of our JournalNodes to slowly fill up > and eventually run out of disk space. Because all the JournalNodes store > effectively the same things, they all filled up at nearly the same time. > Since the JournalNodes could no longer write new transactions, NN1 and NN2 > both started entering restart loops, since as soon as they finished booting > up and out of safe mode, and the ZKFC made them active, they crashed after > being unable to persist new transactions. NN3 stayed up in observer mode the > whole time, never crashing as it never tried to write new transactions. > Because they are just on VMs, we simply increased the disk size of the > JournalNodes to get them functioning again. After this, NN1 and NN2 were > still in the process of booting up, so we put NN3 into standby mode so that > the ZKFC could make it active right away, getting our system back online. > After this, NN1 and NN2 failed to boot up do to a missing edits file on the > journal nodes. > We believe this all stems from the fact that transactions are added to the > edit cache on the journal nodes [before they are persisted to > disk|https://github.com/apache/hadoop/blob/f38d7072566e88c77e47d1533e4be4c1bd98a06a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/Journal.java#L433]. > We think what happened is something like: > * Before disks filled up, NN1 successfully committed transaction 0096 to the > Journal Nodes. > * NN1 attempted to write transactions 0097 and 0098 to the journal nodes. > These transactions got added to the edit cache, but then failed to persist to > disk because the disk was full. The write failed on NN1 and it crashed and > restarted. NN2 then became active and entered the same crash and restart loop. > * NN3 was tailing the edits, and the journal nodes all returned transactions > 0097 and 0098 from the edits cache. Because of this NN3 thinks that up to > transaction 0098 have been durably persisted. > * Disk sizes are increased and journal nodes are able to write transactions > again. > * NN3 becomes active, thinks that up to transaction 0098 have been committed, > and begins writing new transactions starting at 0099, and the journal nodes > update their committed transaction ID up to 0099. > * No journal nodes actually have transactions 0097 and 0098 written to disk, > so when NN1 and NN2 start up, they fail to load edits from the journal node > because the journals think it should have up through transactions 0099, but > can't find any file with those edits. > I had to manually delete all edits files associated with any transaction >= > 0099, and manually edit the committed-txn file back to 0096 to finally get > all the NameNodes to boot back up to a consistent state. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org