[ https://issues.apache.org/jira/browse/HDFS-17710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated HDFS-17710: ---------------------------------- Labels: pull-request-available (was: ) > Standby node can load unpersisted edit from JournalNode cache > ------------------------------------------------------------- > > Key: HDFS-17710 > URL: https://issues.apache.org/jira/browse/HDFS-17710 > Project: Hadoop HDFS > Issue Type: Bug > Components: journal-node > Affects Versions: 3.4.1 > Reporter: Adam Binford > Priority: Major > Labels: pull-request-available > > A standby or observer node can load edits from the journal node that failed > to be durably persisted. This can cause the standby or observer node to > incorrectly think that the last committed transaction ID is higher than it > actually is. This is the scenario that led us to find this: > We have three NameNodes, NN1, NN2, and NN3. NN1 is active, NN2 is standby, > and NN3 is observer. NN2 was failing to upload fsimage checkpoints to the > other NameNodes, for reasons we are still investigating. But because a > checkpoint was never able to be fully created, the JournalNodes could never > cleanup old edit files. This led all 3 of our JournalNodes to slowly fill up > and eventually run out of disk space. Because all the JournalNodes store > effectively the same things, they all filled up at nearly the same time. > Since the JournalNodes could no longer write new transactions, NN1 and NN2 > both started entering restart loops, since as soon as they finished booting > up and out of safe mode, and the ZKFC made them active, they crashed after > being unable to persist new transactions. NN3 stayed up in observer mode the > whole time, never crashing as it never tried to write new transactions. > Because they are just on VMs, we simply increased the disk size of the > JournalNodes to get them functioning again. After this, NN1 and NN2 were > still in the process of booting up, so we put NN3 into standby mode so that > the ZKFC could make it active right away, getting our system back online. > After this, NN1 and NN2 failed to boot up do to a missing edits file on the > journal nodes. > We believe this all stems from the fact that transactions are added to the > edit cache on the journal nodes [before they are persisted to > disk|https://github.com/apache/hadoop/blob/f38d7072566e88c77e47d1533e4be4c1bd98a06a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/Journal.java#L433]. > We think what happened is something like: > * Before disks filled up, NN1 successfully committed transaction 0096 to the > Journal Nodes. > * NN1 attempted to write transactions 0097 and 0098 to the journal nodes. > These transactions got added to the edit cache, but then failed to persist to > disk because the disk was full. The write failed on NN1 and it crashed and > restarted. NN2 then became active and entered the same crash and restart loop. > * NN3 was tailing the edits, and the journal nodes all returned transactions > 0097 and 0098 from the edits cache. Because of this NN3 thinks that up to > transaction 0098 have been durably persisted. > * Disk sizes are increased and journal nodes are able to write transactions > again. > * NN3 becomes active, thinks that up to transaction 0098 have been committed, > and begins writing new transactions starting at 0099, and the journal nodes > update their committed transaction ID up to 0099. > * No journal nodes actually have transactions 0097 and 0098 written to disk, > so when NN1 and NN2 start up, they fail to load edits from the journal node > because the journals think it should have up through transactions 0099, but > can't find any file with those edits. > I had to manually delete all edits files associated with any transaction >= > 0099, and manually edit the committed-txn file back to 0096 to finally get > all the NameNodes to boot back up to a consistent state. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org