Steven Rand created HDFS-15114: ---------------------------------- Summary: JournalNodes' committed-txid file includes aborted transaction, breaks NameNode startup Key: HDFS-15114 URL: https://issues.apache.org/jira/browse/HDFS-15114 Project: Hadoop HDFS Issue Type: Bug Components: journal-node, namenode Affects Versions: 3.2.1 Reporter: Steven Rand
A couple of days ago, our active NameNode in an HA setup aborted a {{QuorumOutputStream}} starting at tx 3389424 because tx 3389425 failed to be written. This was likely related to a rolling restart of the three JournalNodes that was happening at this time. The NameNode logged: {code:java} 2020-01-11 02:00:50,229 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll Edit Log from 10.6.1.181 2020-01-11 02:00:50,229 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Rolling edit logs 2020-01-11 02:00:50,229 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Ending log segment 3389424, 3389424 2020-01-11 02:00:50,229 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 2 Total time for transactions(ms): 1 Number of transactions batched in Syncs: 0 Number of syncs: 1 SyncTimes(ms): 1 7 2020-01-11 02:00:50,245 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Remote journal 10.6.2.187:8485 failed to write txns 3389425-3389425. Will try to write to this JN again after the next log roll. org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.qjournal.protocol.JournalOutOfSyncException): Can't write, no segment open ; journal id: <journal_name> at org.apache.hadoop.hdfs.qjournal.server.Journal.checkSync(Journal.java:545) ... rest of stacktrace ... // the same warning for the second JournalNode // the same warning for the third JournalNode 2020-01-11 02:00:50,246 ERROR org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: flush failed for required journal (JournalAndStream(mgr=QJM to [10.6.1.4:8485, 10.6.1.181:8485, 10.6.2.187:8485], stream=QuorumOutputStream starting at txid 3389424)) org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions to achieve quorum size 2/3. 3 exceptions thrown: // the same "Can't write, no segment open ; journal id: <journal_name>" error for all 3 JournalNodes 2020-01-11 02:00:50,246 WARN org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Aborting QuorumOutputStream starting at txid 3389424 2020-01-11 02:00:50,255 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1: Error: flush failed for required journal (JournalAndStream(mgr=QJM to [10.6.1.4:8485, 10.6.1.181:8485, 10.6.2.187:8485], stream=QuorumOutputStream starting at txid 3389424)) {code} Even though the stream was aborted, the {{committed-txid}} file on each of the three JournalNodes was updated to be {{3389424}}. This caused both NameNodes to fail to start with this error: {code:java} 2020-01-11 02:54:35,483 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Starting services required for active state 2020-01-11 02:54:35,491 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Starting recovery process for unclosed journal segments... 2020-01-11 02:54:35,537 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Successfully started new epoch 80 2020-01-11 02:54:35,537 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Beginning recovery of unclosed segment starting at txid 3389422 2020-01-11 02:54:35,574 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Recovery prepare phase complete. Responses: 10.6.1.4:8485: segmentState { startTxId: 3389422 endTxId: 3389423 isInProgress: false } lastWriterEpoch: 57 lastCommittedTxId: 3389424 10.6.2.187:8485: segmentState { startTxId: 3389422 endTxId: 3389423 isInProgress: false } lastWriterEpoch: 57 lastCommittedTxId: 3389424 2020-01-11 02:54:35,575 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Using longest log: 10.6.1.4:8485=segmentState { startTxId: 3389422 endTxId: 3389423 isInProgress: false } lastWriterEpoch: 57 lastCommittedTxId: 3389424 2020-01-11 02:54:35,575 ERROR org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: recoverUnfinalizedSegments failed for required journal (JournalAndStream(mgr=QJM to [10.6.1.4:8485, 10.6.1.181:8485, 10.6.2.187:8485], stream=null)) java.lang.AssertionError: Decided to synchronize log to startTxId: 3389422 endTxId: 3389423 isInProgress: false but logger 10.6.1.4:8485 had seen txid 3389424 committed ... rest of stacktrace ... 2020-01-11 02:54:35,577 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1: Error: recoverUnfinalizedSegments failed for required journal (JournalAndStream(mgr=QJM to [10.6.1.4:8485, 10.6.1.181:8485, 10.6.2.187:8485], stream=null)) {code} A potentially relevant detail is that each of the three JournalNodes logged this at around the time of the aborted stream: {code:java} 2020-01-11 02:00:32,649 INFO org.apache.hadoop.hdfs.qjournal.server.Journal: Latest log is EditLogFile(file=<path_to_edits>/<journal_name>/current/edits_inprogress_0000000000003389424,first=0000000000003389424,last=0000000000003389424,inProgress=true,hasCorruptHeader=false) ; journal id: <journal_name> {code} That file was later renamed to {{edits_inprogress_0000000000003389424.stale}}, and the JournalNodes no longer consider it to be the most recent edit file. Instead, they now log that {{edits_0000000000003389422-0000000000003389423}} is the most recent file they know about. So what seems to have happened is: 1. The active NameNode tries to write transactions 3389424 and 3389425 to the journal. 2. Tx 3389424 is written successfully, and the {{committed-txid}} file is updated to reflect this. 3. Tx 3389425 fails to be written, so the NameNode aborts the {{QuorumOutputStream}} that contained 3389424 and 3389425. 4. Because the stream was aborted, the JournalNodes rename {{edits_inprogress_0000000000003389424}} to {{edits_inprogress_0000000000003389424.stale}} and forget about tx 3389424. 5. The NameNodes fail to start because the latest committed tx according to the JournalNodes (which comes from the {{committed-txid}} file) is in the future, which (understandably) is an illegal state. I believe that the NameNodes and the JournalNodes are in a consistent state, and that there's no corruption or data loss or anything like that. I think the problem is just that the {{committed-txid}} file was updated prematurely, and contains an incorrect value. For this particular cluster, I ran {{./service/bin/hdfs namenode -initializeSharedEdits}} to format the JournalNodes, but I backed up the edits directory first in case there's more information in there that's helpful for debugging. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org