[ 
https://issues.apache.org/jira/browse/HDFS-17710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17948962#comment-17948962
 ] 

ASF GitHub Bot commented on HDFS-17710:
---------------------------------------

hadoop-yetus commented on PR #7296:
URL: https://github.com/apache/hadoop/pull/7296#issuecomment-2847897424

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |:----:|----------:|--------:|:--------:|:-------:|
   | +0 :ok: |  reexec  |   0m 35s |  |  Docker mode activated.  |
   |||| _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
   |||| _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  35m 48s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 23s |  |  trunk passed with JDK 
Ubuntu-11.0.26+4-post-Ubuntu-1ubuntu120.04  |
   | +1 :green_heart: |  compile  |   1m 11s |  |  trunk passed with JDK 
Private Build-1.8.0_442-8u442-b06~us1-0ubuntu1~20.04-b06  |
   | +1 :green_heart: |  checkstyle  |   1m 10s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 19s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m  9s |  |  trunk passed with JDK 
Ubuntu-11.0.26+4-post-Ubuntu-1ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   1m 39s |  |  trunk passed with JDK 
Private Build-1.8.0_442-8u442-b06~us1-0ubuntu1~20.04-b06  |
   | +1 :green_heart: |  spotbugs  |   3m  8s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  35m 58s |  |  branch has no errors 
when building and testing our client artifacts.  |
   |||| _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   1m  5s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 12s |  |  the patch passed with JDK 
Ubuntu-11.0.26+4-post-Ubuntu-1ubuntu120.04  |
   | +1 :green_heart: |  javac  |   1m 12s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m  2s |  |  the patch passed with JDK 
Private Build-1.8.0_442-8u442-b06~us1-0ubuntu1~20.04-b06  |
   | +1 :green_heart: |  javac  |   1m  2s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   0m 56s | 
[/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7296/5/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs-project/hadoop-hdfs: The patch generated 2 new + 5 unchanged - 
0 fixed = 7 total (was 5)  |
   | +1 :green_heart: |  mvnsite  |   1m  6s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 56s |  |  the patch passed with JDK 
Ubuntu-11.0.26+4-post-Ubuntu-1ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   1m 28s |  |  the patch passed with JDK 
Private Build-1.8.0_442-8u442-b06~us1-0ubuntu1~20.04-b06  |
   | +1 :green_heart: |  spotbugs  |   3m  7s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  37m 42s |  |  patch has no errors 
when building and testing our client artifacts.  |
   |||| _ Other Tests _ |
   | -1 :x: |  unit  | 258m 13s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7296/5/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | -1 :x: |  asflicense  |   0m 55s | 
[/results-asflicense.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7296/5/artifact/out/results-asflicense.txt)
 |  The patch generated 2925 ASF License warnings.  |
   |  |   | 389m 30s |  |  |
   
   
   | Reason | Tests |
   |-------:|:------|
   | Failed junit tests | hadoop.fs.viewfs.TestViewFileSystemLinkMergeSlash |
   |   | hadoop.fs.viewfs.TestViewFsAtHdfsRoot |
   |   | hadoop.hdfs.TestDatanodeRegistration |
   |   | hadoop.fs.viewfs.TestViewFileSystemAtHdfsRoot |
   |   | hadoop.fs.TestSWebHdfsFileContextMainOperations |
   |   | hadoop.hdfs.server.namenode.TestCacheDirectives |
   |   | hadoop.fs.TestWebHdfsFileContextMainOperations |
   |   | hadoop.hdfs.server.blockmanagement.TestBlockReportLease |
   |   | hadoop.fs.viewfs.TestViewFileSystemLinkRegex |
   |   | hadoop.fs.viewfs.TestViewFileSystemLinkFallback |
   |   | 
hadoop.fs.viewfs.TestViewFileSystemOverloadSchemeHdfsFileSystemContract |
   |   | hadoop.fs.TestFcHdfsCreateMkdir |
   |   | hadoop.fs.viewfs.TestViewFsHdfs |
   |   | hadoop.fs.viewfs.TestViewFileSystemHdfs |
   |   | hadoop.fs.TestHDFSFileContextMainOperations |
   |   | hadoop.hdfs.server.blockmanagement.TestDatanodeManager |
   
   
   | Subsystem | Report/Notes |
   |----------:|:-------------|
   | Docker | ClientAPI=1.49 ServerAPI=1.49 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7296/5/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/7296 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 84e7c8cc2b04 5.15.0-136-generic #147-Ubuntu SMP Sat Mar 15 
15:53:30 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 5b1e6ebe9bf7e026df5e6da72b3784dfa1d5492e |
   | Default Java | Private Build-1.8.0_442-8u442-b06~us1-0ubuntu1~20.04-b06 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.26+4-post-Ubuntu-1ubuntu120.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private 
Build-1.8.0_442-8u442-b06~us1-0ubuntu1~20.04-b06 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7296/5/testReport/ |
   | Max. process+thread count | 3643 (vs. ulimit of 5500) |
   | modules | C: hadoop-hdfs-project/hadoop-hdfs U: 
hadoop-hdfs-project/hadoop-hdfs |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-7296/5/console |
   | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 |
   | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   




> Standby node can load unpersisted edit from JournalNode cache
> -------------------------------------------------------------
>
>                 Key: HDFS-17710
>                 URL: https://issues.apache.org/jira/browse/HDFS-17710
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: journal-node
>    Affects Versions: 3.4.1
>            Reporter: Adam Binford
>            Priority: Major
>              Labels: pull-request-available
>
> A standby or observer node can load edits from the journal node that failed 
> to be durably persisted. This can cause the standby or observer node to 
> incorrectly think that the last committed transaction ID is higher than it 
> actually is. This is the scenario that led us to find this:
> We have three NameNodes, NN1, NN2, and NN3. NN1 is active, NN2 is standby, 
> and NN3 is observer. NN2 was failing to upload fsimage checkpoints to the 
> other NameNodes, for reasons we are still investigating. But because a 
> checkpoint was never able to be fully created, the JournalNodes could never 
> cleanup old edit files. This led all 3 of our JournalNodes to slowly fill up 
> and eventually run out of disk space. Because all the JournalNodes store 
> effectively the same things, they all filled up at nearly the same time.
> Since the JournalNodes could no longer write new transactions, NN1 and NN2 
> both started entering restart loops, since as soon as they finished booting 
> up and out of safe mode, and the ZKFC made them active, they crashed after 
> being unable to persist new transactions. NN3 stayed up in observer mode the 
> whole time, never crashing as it never tried to write new transactions.
> Because they are just on VMs, we simply increased the disk size of the 
> JournalNodes to get them functioning again. After this, NN1 and NN2 were 
> still in the process of booting up, so we put NN3 into standby mode so that 
> the ZKFC could make it active right away, getting our system back online. 
> After this, NN1 and NN2 failed to boot up do to a missing edits file on the 
> journal nodes.
> We believe this all stems from the fact that transactions are added to the 
> edit cache on the journal nodes [before they are persisted to 
> disk|https://github.com/apache/hadoop/blob/f38d7072566e88c77e47d1533e4be4c1bd98a06a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/Journal.java#L433].
>  We think what happened is something like:
> * Before disks filled up, NN1 successfully committed transaction 0096 to the 
> Journal Nodes. 
> * NN1 attempted to write transactions 0097 and 0098 to the journal nodes. 
> These transactions got added to the edit cache, but then failed to persist to 
> disk because the disk was full. The write failed on NN1 and it crashed and 
> restarted. NN2 then became active and entered the same crash and restart loop.
> * NN3 was tailing the edits, and the journal nodes all returned transactions 
> 0097 and 0098 from the edits cache. Because of this NN3 thinks that up to 
> transaction 0098 have been durably persisted.
> * Disk sizes are increased and journal nodes are able to write transactions 
> again.
> * NN3 becomes active, thinks that up to transaction 0098 have been committed, 
> and begins writing new transactions starting at 0099, and the journal nodes 
> update their committed transaction ID up to 0099.
> * No journal nodes actually have transactions 0097 and 0098 written to disk, 
> so when NN1 and NN2 start up, they fail to load edits from the journal node 
> because the journals think it should have up through transactions 0099, but 
> can't find any file with those edits.
> I had to manually delete all edits files associated with any transaction >= 
> 0099, and manually edit the committed-txn file back to 0096 to finally get 
> all the NameNodes to boot back up to a consistent state.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to