Steven Rand created HDFS-15114:
----------------------------------

             Summary: JournalNodes' committed-txid file includes aborted 
transaction, breaks NameNode startup
                 Key: HDFS-15114
                 URL: https://issues.apache.org/jira/browse/HDFS-15114
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: journal-node, namenode
    Affects Versions: 3.2.1
            Reporter: Steven Rand


A couple of days ago, our active NameNode in an HA setup aborted a 
{{QuorumOutputStream}} starting at tx 3389424 because tx 3389425 failed to be 
written. This was likely related to a rolling restart of the three JournalNodes 
that was happening at this time. The NameNode logged:
{code:java}
2020-01-11 02:00:50,229 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll Edit Log from 
10.6.1.181
2020-01-11 02:00:50,229 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: 
Rolling edit logs
2020-01-11 02:00:50,229 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: 
Ending log segment 3389424, 3389424
2020-01-11 02:00:50,229 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: 
Number of transactions: 2 Total time for transactions(ms): 1 Number of 
transactions batched in Syncs:
 0 Number of syncs: 1 SyncTimes(ms): 1 7
2020-01-11 02:00:50,245 WARN 
org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Remote journal 
10.6.2.187:8485 failed to write txns 3389425-3389425. Will try to write to this 
JN again after the next log roll.
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.qjournal.protocol.JournalOutOfSyncException):
 Can't write, no segment open ; journal id: <journal_name>
at org.apache.hadoop.hdfs.qjournal.server.Journal.checkSync(Journal.java:545)
... rest of stacktrace ...

// the same warning for the second JournalNode
// the same warning for the third JournalNode

2020-01-11 02:00:50,246 ERROR org.apache.hadoop.hdfs.server.namenode.FSEditLog: 
Error: flush failed for required journal (JournalAndStream(mgr=QJM to 
[10.6.1.4:8485, 10.6.1.181:8485, 10.6.2.187:8485], stream=QuorumOutputStream 
starting at txid 3389424))
org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions 
to achieve quorum size 2/3. 3 exceptions thrown:

// the same "Can't write, no segment open ; journal id: <journal_name>" error 
for all 3 JournalNodes

2020-01-11 02:00:50,246 WARN 
org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Aborting 
QuorumOutputStream starting at txid 3389424
2020-01-11 02:00:50,255 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
status 1: Error: flush failed for required journal (JournalAndStream(mgr=QJM to 
[10.6.1.4:8485, 10.6.1.181:8485, 10.6.2.187:8485], stream=QuorumOutputStream 
starting at txid 3389424))
{code}
Even though the stream was aborted, the {{committed-txid}} file on each of the 
three JournalNodes was updated to be {{3389424}}.

This caused both NameNodes to fail to start with this error:
  
{code:java}
2020-01-11 02:54:35,483 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Starting services required 
for active state
2020-01-11 02:54:35,491 INFO 
org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Starting recovery 
process for unclosed journal segments...
2020-01-11 02:54:35,537 INFO 
org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Successfully 
started new epoch 80
2020-01-11 02:54:35,537 INFO 
org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Beginning recovery 
of unclosed segment starting at txid 3389422
2020-01-11 02:54:35,574 INFO 
org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Recovery prepare 
phase complete. Responses:
10.6.1.4:8485: segmentState { startTxId: 3389422 endTxId: 3389423 isInProgress: 
false } lastWriterEpoch: 57 lastCommittedTxId: 3389424
10.6.2.187:8485: segmentState { startTxId: 3389422 endTxId: 3389423 
isInProgress: false } lastWriterEpoch: 57 lastCommittedTxId: 3389424
2020-01-11 02:54:35,575 INFO 
org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Using longest log: 
10.6.1.4:8485=segmentState {
  startTxId: 3389422
  endTxId: 3389423
  isInProgress: false
}
lastWriterEpoch: 57
lastCommittedTxId: 3389424

2020-01-11 02:54:35,575 ERROR org.apache.hadoop.hdfs.server.namenode.FSEditLog: 
Error: recoverUnfinalizedSegments failed for required journal 
(JournalAndStream(mgr=QJM to [10.6.1.4:8485, 10.6.1.181:8485, 10.6.2.187:8485], 
stream=null))
java.lang.AssertionError: Decided to synchronize log to startTxId: 3389422
endTxId: 3389423
isInProgress: false
 but logger 10.6.1.4:8485 had seen txid 3389424 committed
        ... rest of stacktrace ...
2020-01-11 02:54:35,577 INFO org.apache.hadoop.util.ExitUtil: Exiting with 
status 1: Error: recoverUnfinalizedSegments failed for required journal 
(JournalAndStream(mgr=QJM to [10.6.1.4:8485, 10.6.1.181:8485, 10.6.2.187:8485], 
stream=null))
{code}
A potentially relevant detail is that each of the three JournalNodes logged 
this at around the time of the aborted stream:
{code:java}
2020-01-11 02:00:32,649 INFO org.apache.hadoop.hdfs.qjournal.server.Journal: 
Latest log is 
EditLogFile(file=<path_to_edits>/<journal_name>/current/edits_inprogress_0000000000003389424,first=0000000000003389424,last=0000000000003389424,inProgress=true,hasCorruptHeader=false)
 ; journal id: <journal_name>
{code}
That file was later renamed to {{edits_inprogress_0000000000003389424.stale}}, 
and the JournalNodes no longer consider it to be the most recent edit file. 
Instead, they now log that {{edits_0000000000003389422-0000000000003389423}} is 
the most recent file they know about.

So what seems to have happened is:

1. The active NameNode tries to write transactions 3389424 and 3389425 to the 
journal.
 2. Tx 3389424 is written successfully, and the {{committed-txid}} file is 
updated to reflect this.
 3. Tx 3389425 fails to be written, so the NameNode aborts the 
{{QuorumOutputStream}} that contained 3389424 and 3389425.
 4. Because the stream was aborted, the JournalNodes rename 
{{edits_inprogress_0000000000003389424}} to 
{{edits_inprogress_0000000000003389424.stale}} and forget about tx 3389424.
 5. The NameNodes fail to start because the latest committed tx according to 
the JournalNodes (which comes from the {{committed-txid}} file) is in the 
future, which (understandably) is an illegal state.

I believe that the NameNodes and the JournalNodes are in a consistent state, 
and that there's no corruption or data loss or anything like that. I think the 
problem is just that the {{committed-txid}} file was updated prematurely, and 
contains an incorrect value.

For this particular cluster, I ran {{./service/bin/hdfs namenode 
-initializeSharedEdits}} to format the JournalNodes, but I backed up the edits 
directory first in case there's more information in there that's helpful for 
debugging.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to