[jira] [Commented] (HDFS-14557) JournalNode error: Can't scan a pre-transactional edit log

Stephen O'Donnell (JIRA) Tue, 30 Jul 2019 04:43:30 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-14557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16896047#comment-16896047
 ]


Stephen O'Donnell commented on HDFS-14557:
------------------------------------------

I have been able to reproduce this error in a test, but the scenario which 
makes it appear is somewhat strange.

Note that the class throwing the "Can't scan a pre-transactional edit log" 
error is "FSEditLogOp$LegacyReader". Grepping the source for 
"pre-transactional" shows this is the only place that occurs.

Checking the source, it is only used when the reader attempts to load an edit 
log where the layout version is older than -37:

{code}
    @Override
    public long scanOp() throws IOException {
      if (!NameNodeLayoutVersion.supports(
          LayoutVersion.Feature.STORED_TXIDS, logVersion)) {
        throw new IOException("Can't scan a pre-transactional edit log.");
      }
      FSEditLogOp op = decodeOp();
      return op == null ?
          HdfsServerConstants.INVALID_TXID : op.getTransactionId();
    }
}
{code}

Where STORED_TXIDS is:

{code}
STORED_TXIDS(-37, "Transaction IDs are stored in edits log and image files"),
{code}

So far as I can tell, the legacyReader only gets created if the layout version 
is older than LayoutVersion.Feature.EDITS_CHECKSUM, which is -28.

Therefore, for the problem to occur, the edits file needs to have a header 
(which is two integers, so 8 bytes), the first of which is the layout version. 
If you ensure it is larger than -28, then it will cause the "Can't scan a 
pre-transactional edit log" exception.

If there is nothing else in the edit log, then it will loop forever retrying 
that. For example, this test will produce the problem - note I have explicitly 
set the version to -27 when creating the edit log:

{code}
 public void testScanCorruptEditLog2() throws Exception {
    Configuration conf = new Configuration();
    File editLog = new File(GenericTestUtils.getTempPath("testCorruptEditLog"));

    EditLogFileOutputStream elos = new EditLogFileOutputStream(conf,
        editLog.getAbsoluteFile(), 8192);
   // elos.create(NameNodeLayoutVersion.CURRENT_LAYOUT_VERSION);
    elos.create(-27);
    elos.setReadyToFlush();
    elos.flushAndSync(false);
    elos.close();

    FSEditLogLoader.EditLogValidation val = 
EditLogFileInputStream.scanEditLog(editLog, 2, false);
}
{code}

This produces logs like:

{code}
019-07-30 12:17:23,210 [Time-limited test] WARN  namenode.FSImage 
(FSEditLogLoader.java:scanEditLog(1287)) - After resync, position is 8
2019-07-30 12:17:23,210 [Time-limited test] WARN  namenode.FSImage 
(FSEditLogLoader.java:scanEditLog(1282)) - Caught exception after scanning 
through 0 ops from 
/Users/sodonnell/source/upstream_hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/testCorruptEditLog
 while determining its valid length. Position was 8
java.io.IOException: Can't scan a pre-transactional edit log.
        at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogOp$LegacyReader.scanOp(FSEditLogOp.java:5264)
        at 
org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanNextOp(EditLogFileInputStream.java:261)
        at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.scanEditLog(FSEditLogLoader.java:1278)
        at 
org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanEditLog(EditLogFileInputStream.java:345)
        at 
org.apache.hadoop.hdfs.server.namenode.TestEditLogFileInputStream.testScanCorruptEditLog2(TestEditLogFileInputStream.java:212)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
        at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
        at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
        at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
        at 
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
        at 
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.lang.Thread.run(Thread.java:748)
2019-07-30 12:17:23,210 [Time-limited test] WARN  namenode.FSImage 
(FSEditLogLoader.java:scanEditLog(1287)) - After resync, position is 8
2019-07-30 12:17:23,210 [Time-limited test] WARN  namenode.FSImage 
(FSEditLogLoader.java:scanEditLog(1282)) - Caught exception after scanning 
through 0 ops from 
/Users/sodonnell/source/upstream_hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/testCorruptEditLog
 while determining its valid length. Position was 8
java.io.IOException: Can't scan a pre-transactional edit log.
        at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogOp$LegacyReader.scanOp(FSEditLogOp.java:5264)
        at 
org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanNextOp(EditLogFileInputStream.java:261)
        at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.scanEditLog(FSEditLogLoader.java:1278)
        at 
org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanEditLog(EditLogFileInputStream.java:345)
        at 
org.apache.hadoop.hdfs.server.namenode.TestEditLogFileInputStream.testScanCorruptEditLog2(TestEditLogFileInputStream.java:212)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
        at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
        at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
        at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
        at 
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
        at 
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.lang.Thread.run(Thread.java:748)
2019-07-30 12:17:23,210 [Time-limited test] WARN  namenode.FSImage 
(FSEditLogLoader.java:scanEditLog(1287)) - After resync, position is 8
...
{code}

We often see this problem occur after a disk space error on the edit directory, 
possibly indicating this is caused is by a partly written edit. However 
checking the errors from different clusters, the error always mentions 
different positions, sometimes over 1MB into the file. This indicates more data 
was written to the file than just the header, leading to the question around 
how does the header layout version get set to this unexpected value?



> JournalNode error: Can't scan a pre-transactional edit log
> ----------------------------------------------------------
>
>                 Key: HDFS-14557
>                 URL: https://issues.apache.org/jira/browse/HDFS-14557
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: ha
>    Affects Versions: 2.6.0
>            Reporter: Wei-Chiu Chuang
>            Priority: Major
>
> We saw the following error in JournalNodes a few times before.
> {noformat}
> 2016-09-22 12:44:24,505 WARN org.apache.hadoop.hdfs.server.namenode.FSImage: 
> Caught exception after scanning through 0 ops from /data/1/dfs/current/ed
> its_inprogress_0000000000000661942 while determining its valid length. 
> Position was 761856
> java.io.IOException: Can't scan a pre-transactional edit log.
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogOp$LegacyReader.scanOp(FSEditLogOp.java:4592)
> at 
> org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanNextOp(EditLogFileInputStream.java:245)
> at 
> org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanEditLog(EditLogFileInputStream.java:355)
> at 
> org.apache.hadoop.hdfs.server.namenode.FileJournalManager$EditLogFile.scanLog(FileJournalManager.java:551)
> at 
> org.apache.hadoop.hdfs.qjournal.server.Journal.scanStorageForLatestEdits(Journal.java:193)
> at org.apache.hadoop.hdfs.qjournal.server.Journal.<init>(Journal.java:153)
> at 
> org.apache.hadoop.hdfs.qjournal.server.JournalNode.getOrCreateJournal(JournalNode.java:90)
> {noformat}
> The edit file was corrupt, and one possible culprit of this error is a full 
> disk. The JournalNode can't recovered and must be resync manually from other 
> JournalNodes. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HDFS-14557) JournalNode error: Can't scan a pre-transactional edit log

Reply via email to