[ 
https://issues.apache.org/jira/browse/HDFS-17710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HDFS-17710:
----------------------------------
    Labels: pull-request-available  (was: )

> Standby node can load unpersisted edit from JournalNode cache
> -------------------------------------------------------------
>
>                 Key: HDFS-17710
>                 URL: https://issues.apache.org/jira/browse/HDFS-17710
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: journal-node
>    Affects Versions: 3.4.1
>            Reporter: Adam Binford
>            Priority: Major
>              Labels: pull-request-available
>
> A standby or observer node can load edits from the journal node that failed 
> to be durably persisted. This can cause the standby or observer node to 
> incorrectly think that the last committed transaction ID is higher than it 
> actually is. This is the scenario that led us to find this:
> We have three NameNodes, NN1, NN2, and NN3. NN1 is active, NN2 is standby, 
> and NN3 is observer. NN2 was failing to upload fsimage checkpoints to the 
> other NameNodes, for reasons we are still investigating. But because a 
> checkpoint was never able to be fully created, the JournalNodes could never 
> cleanup old edit files. This led all 3 of our JournalNodes to slowly fill up 
> and eventually run out of disk space. Because all the JournalNodes store 
> effectively the same things, they all filled up at nearly the same time.
> Since the JournalNodes could no longer write new transactions, NN1 and NN2 
> both started entering restart loops, since as soon as they finished booting 
> up and out of safe mode, and the ZKFC made them active, they crashed after 
> being unable to persist new transactions. NN3 stayed up in observer mode the 
> whole time, never crashing as it never tried to write new transactions.
> Because they are just on VMs, we simply increased the disk size of the 
> JournalNodes to get them functioning again. After this, NN1 and NN2 were 
> still in the process of booting up, so we put NN3 into standby mode so that 
> the ZKFC could make it active right away, getting our system back online. 
> After this, NN1 and NN2 failed to boot up do to a missing edits file on the 
> journal nodes.
> We believe this all stems from the fact that transactions are added to the 
> edit cache on the journal nodes [before they are persisted to 
> disk|https://github.com/apache/hadoop/blob/f38d7072566e88c77e47d1533e4be4c1bd98a06a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/qjournal/server/Journal.java#L433].
>  We think what happened is something like:
> * Before disks filled up, NN1 successfully committed transaction 0096 to the 
> Journal Nodes. 
> * NN1 attempted to write transactions 0097 and 0098 to the journal nodes. 
> These transactions got added to the edit cache, but then failed to persist to 
> disk because the disk was full. The write failed on NN1 and it crashed and 
> restarted. NN2 then became active and entered the same crash and restart loop.
> * NN3 was tailing the edits, and the journal nodes all returned transactions 
> 0097 and 0098 from the edits cache. Because of this NN3 thinks that up to 
> transaction 0098 have been durably persisted.
> * Disk sizes are increased and journal nodes are able to write transactions 
> again.
> * NN3 becomes active, thinks that up to transaction 0098 have been committed, 
> and begins writing new transactions starting at 0099, and the journal nodes 
> update their committed transaction ID up to 0099.
> * No journal nodes actually have transactions 0097 and 0098 written to disk, 
> so when NN1 and NN2 start up, they fail to load edits from the journal node 
> because the journals think it should have up through transactions 0099, but 
> can't find any file with those edits.
> I had to manually delete all edits files associated with any transaction >= 
> 0099, and manually edit the committed-txn file back to 0096 to finally get 
> all the NameNodes to boot back up to a consistent state.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to