[ https://issues.apache.org/jira/browse/HDFS-17710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17921309#comment-17921309 ]
ASF GitHub Bot commented on HDFS-17710: --------------------------------------- hadoop-yetus commented on PR #7296: URL: https://github.com/apache/hadoop/pull/7296#issuecomment-2615227769 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |:----:|----------:|--------:|:--------:|:-------:| | +0 :ok: | reexec | 17m 42s | | Docker mode activated. | |||| _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | |||| _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 39m 7s | | trunk passed | | +1 :green_heart: | compile | 1m 20s | | trunk passed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04 | | +1 :green_heart: | compile | 1m 10s | | trunk passed with JDK Private Build-1.8.0_432-8u432-ga~us1-0ubuntu2~20.04-ga | | +1 :green_heart: | checkstyle | 1m 11s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 18s | | trunk passed | | +1 :green_heart: | javadoc | 1m 12s | | trunk passed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04 | | +1 :green_heart: | javadoc | 1m 35s | | trunk passed with JDK Private Build-1.8.0_432-8u432-ga~us1-0ubuntu2~20.04-ga | | +1 :green_heart: | spotbugs | 3m 8s | | trunk passed | | +1 :green_heart: | shadedclient | 35m 57s | | branch has no errors when building and testing our client artifacts. | |||| _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 5s | | the patch passed | | +1 :green_heart: | compile | 1m 12s | | the patch passed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04 | | +1 :green_heart: | javac | 1m 12s | | the patch passed | | +1 :green_heart: | compile | 1m 1s | | the patch passed with JDK Private Build-1.8.0_432-8u432-ga~us1-0ubuntu2~20.04-ga | | +1 :green_heart: | javac | 1m 1s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 0m 56s | [/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7296/2/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs-project/hadoop-hdfs: The patch generated 2 new + 5 unchanged - 0 fixed = 7 total (was 5) | | +1 :green_heart: | mvnsite | 1m 4s | | the patch passed | | +1 :green_heart: | javadoc | 0m 57s | | the patch passed with JDK Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04 | | +1 :green_heart: | javadoc | 1m 30s | | the patch passed with JDK Private Build-1.8.0_432-8u432-ga~us1-0ubuntu2~20.04-ga | | +1 :green_heart: | spotbugs | 3m 6s | | the patch passed | | +1 :green_heart: | shadedclient | 36m 19s | | patch has no errors when building and testing our client artifacts. | |||| _ Other Tests _ | | +1 :green_heart: | unit | 298m 49s | | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 46s | | The patch does not generate ASF License warnings. | | | | 448m 46s | | | | Subsystem | Report/Notes | |----------:|:-------------| | Docker | ClientAPI=1.47 ServerAPI=1.47 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7296/2/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/7296 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 0326644bff36 5.15.0-130-generic #140-Ubuntu SMP Wed Dec 18 17:59:53 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / e609064c7e307fbfb4a62c06a1efed0e3934a6b0 | | Default Java | Private Build-1.8.0_432-8u432-ga~us1-0ubuntu2~20.04-ga | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.25+9-post-Ubuntu-1ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_432-8u432-ga~us1-0ubuntu2~20.04-ga | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7296/2/testReport/ | | Max. process+thread count | 3144 (vs. ulimit of 5500) | | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs | | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7296/2/console | | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 | | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org | This message was automatically generated. > Standby node can load unpersisted edit from JournalNode cache > ------------------------------------------------------------- > > Key: HDFS-17710 > URL: https://issues.apache.org/jira/browse/HDFS-17710 > Project: Hadoop HDFS > Issue Type: Bug > Components: journal-node > Affects Versions: 3.4.1 > Reporter: Adam Binford > Priority: Major > > A standby or observer node can load edits from the journal node that failed > to be durably persisted. This can cause the standby or observer node to > incorrectly think that the last committed transaction ID is higher than it > actually is. This is the scenario that led us to find this: > We have three NameNodes, NN1, NN2, and NN3. NN1 is active, NN2 is standby, > and NN3 is observer. NN2 was failing to upload fsimage checkpoints to the > other NameNodes, for reasons we are still investigating. But because a > checkpoint was never able to be fully created, the JournalNodes could never > cleanup old edit files. This led all 3 of our JournalNodes to slowly fill up > and eventually run out of disk space. Because all the JournalNodes store > effectively the same things, they all filled up at nearly the same time. > Since the JournalNodes could no longer write new transactions, NN1 and NN2 > both started entering restart loops, since as soon as they finished booting > up and out of safe mode, and the ZKFC made them active, they crashed after > being unable to persist new transactions. NN3 stayed up in observer mode the > whole time, never crashing as it never tried to write new transactions. > Because they are just on VMs, we simply increased the disk size of the > JournalNodes to get them functioning again. After this, NN1 and NN2 were > still in the process of booting up, so we put NN3 into standby mode so that > the ZKFC could make it active right away, getting our system back online. > After this, NN1 and NN2 failed to boot up do to a missing edits file on the > journal nodes. > We believe this all stems from the fact that transactions are added to the > edit cache on the journal nodes [before they are persisted to > disk|https://github.com/apache/hadoop/blob/f38d7072566e88c77e47d1533e4be4c1bd98a06a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/Journal.java#L433]. > We think what happened is something like: > * Before disks filled up, NN1 successfully committed transaction 0096 to the > Journal Nodes. > * NN1 attempted to write transactions 0097 and 0098 to the journal nodes. > These transactions got added to the edit cache, but then failed to persist to > disk because the disk was full. The write failed on NN1 and it crashed and > restarted. NN2 then became active and entered the same crash and restart loop. > * NN3 was tailing the edits, and the journal nodes all returned transactions > 0097 and 0098 from the edits cache. Because of this NN3 thinks that up to > transaction 0098 have been durably persisted. > * Disk sizes are increased and journal nodes are able to write transactions > again. > * NN3 becomes active, thinks that up to transaction 0098 have been committed, > and begins writing new transactions starting at 0099, and the journal nodes > update their committed transaction ID up to 0099. > * No journal nodes actually have transactions 0097 and 0098 written to disk, > so when NN1 and NN2 start up, they fail to load edits from the journal node > because the journals think it should have up through transactions 0099, but > can't find any file with those edits. > I had to manually delete all edits files associated with any transaction >= > 0099, and manually edit the committed-txn file back to 0096 to finally get > all the NameNodes to boot back up to a consistent state. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org