[ 
https://issues.apache.org/jira/browse/HDFS-14557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16895162#comment-16895162
 ] 

Stephen O'Donnell commented on HDFS-14557:
------------------------------------------

I guess there are two parts to this problem - (1) prevent a bad edit getting 
written in the first place and (2) allow the JN to startup again if it does 
have a corrupt edit file.

Looking at the logs from when the space issue occurs, you can see it gets stuck 
in a loop, where it tries to read at position 4096, fails, then resyncs to 4096 
and just does that forever, quickly causing the logs to roll off:

{code}
2019-05-08 05:19:10,283 WARN org.apache.hadoop.hdfs.server.namenode.FSImage: 
After resync, position is 4096
2019-05-08 05:19:10,283 WARN org.apache.hadoop.hdfs.server.namenode.FSImage: 
Caught exception after scanning through 0 ops from 
/hadoop03/jn/journal/current/edits_inprogress_0000000082469913692 while 
determining its valid length. Position was 4096
java.io.IOException: Can't scan a pre-transactional edit log.
    at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogOp$LegacyReader.scanOp(FSEditLogOp.java:4610)
    at 
org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanNextOp(EditLogFileInputStream.java:245)
    at 
org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanEditLog(EditLogFileInputStream.java:355)
    at 
org.apache.hadoop.hdfs.server.namenode.FileJournalManager$EditLogFile.scanLog(FileJournalManager.java:551)
    at 
org.apache.hadoop.hdfs.qjournal.server.Journal.scanStorageForLatestEdits(Journal.java:193)
    at org.apache.hadoop.hdfs.qjournal.server.Journal.<init>(Journal.java:153)
    at 
org.apache.hadoop.hdfs.qjournal.server.JournalNode.getOrCreateJournal(JournalNode.java:95)
    at 
org.apache.hadoop.hdfs.qjournal.server.JournalNode.getOrCreateJournal(JournalNode.java:104)
    at 
org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.getEditLogManifest(JournalNodeRpcServer.java:186)
    at 
org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.getEditLogManifest(QJournalProtocolServerSideTranslatorPB.java:236)
    at 
org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25431)
    at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2281)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2277)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1924)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2275)
2019-05-08 05:19:10,283 WARN org.apache.hadoop.hdfs.server.namenode.FSImage: 
After resync, position is 4096
2019-05-08 05:19:10,283 WARN org.apache.hadoop.hdfs.server.namenode.FSImage: 
Caught exception after scanning through 0 ops from 
/hadoop03/jn/journal/current/edits_inprogress_0000000082469913692 while 
determining its valid length. Position was 4096
java.io.IOException: Can't scan a pre-transactional edit log.
    at 
org.apache.hadoop.hdfs.server.namenode.FSEditLogOp$LegacyReader.scanOp(FSEditLogOp.java:4610)
    at 
org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanNextOp(EditLogFileInputStream.java:245)
    at 
org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanEditLog(EditLogFileInputStream.java:355)
    at 
org.apache.hadoop.hdfs.server.namenode.FileJournalManager$EditLogFile.scanLog(FileJournalManager.java:551)
    at 
org.apache.hadoop.hdfs.qjournal.server.Journal.scanStorageForLatestEdits(Journal.java:193)
    at org.apache.hadoop.hdfs.qjournal.server.Journal.<init>(Journal.java:153)
    at 
org.apache.hadoop.hdfs.qjournal.server.JournalNode.getOrCreateJournal(JournalNode.java:95)
    at 
org.apache.hadoop.hdfs.qjournal.server.JournalNode.getOrCreateJournal(JournalNode.java:104)
    at 
org.apache.hadoop.hdfs.qjournal.server.JournalNodeRpcServer.getEditLogManifest(JournalNodeRpcServer.java:186)
    at 
org.apache.hadoop.hdfs.qjournal.protocolPB.QJournalProtocolServerSideTranslatorPB.getEditLogManifest(QJournalProtocolServerSideTranslatorPB.java:236)
    at 
org.apache.hadoop.hdfs.qjournal.protocol.QJournalProtocolProtos$QJournalProtocolService$2.callBlockingMethod(QJournalProtocolProtos.java:25431)
    at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2281)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2277)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1924)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2275)
2019-05-08 05:19:10,283 WARN org.apache.hadoop.hdfs.server.namenode.FSImage: 
After resync, position is 4096
2019-05-08 05:19:10,283 WARN org.apache.hadoop.hdfs.server.namenode.FSImage: 
Caught exception after scanning through 0 ops from 
/hadoop03/jn/journal/current/edits_inprogress_0000000082469913692 while 
determining its valid length. Position was 4096
{code}

This happens in EditLogFileInputStream.scanEditLog() in this loop:

{code}
      while (true) {
        long txid = HdfsConstants.INVALID_TXID;
        lastPos = in.getPosition();
        try {
          if ((txid = in.scanNextOp()) == HdfsConstants.INVALID_TXID) {
            break;
          }
        } catch (Throwable t) {
          FSImage.LOG.warn("Caught exception after scanning through "
              + numValid + " ops from " + in
              + " while determining its valid length. Position was "
              + lastPos, t);
          in.resync();
          FSImage.LOG.warn("After resync, position is " + in.getPosition());
          continue;
        }
        if (lastTxId == HdfsConstants.INVALID_TXID || txid > lastTxId) {
          lastTxId = txid;
        }
        numValid++;
      }
{code}

I wonder should we add a check in the exception handler after in.resync():

{code}
if (in.getPosition == lastPos) {
  return an invalid status, as resync is going back to the same bad position
}
{code}

Then in the higher level code, we could sideline the bad file by renaming it 
(it already does this for empty files). The JN will later sync up with the 
others over time, but it will be missing the very latest edits.

> JournalNode error: Can't scan a pre-transactional edit log
> ----------------------------------------------------------
>
>                 Key: HDFS-14557
>                 URL: https://issues.apache.org/jira/browse/HDFS-14557
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: ha
>    Affects Versions: 2.6.0
>            Reporter: Wei-Chiu Chuang
>            Priority: Major
>
> We saw the following error in JournalNodes a few times before.
> {noformat}
> 2016-09-22 12:44:24,505 WARN org.apache.hadoop.hdfs.server.namenode.FSImage: 
> Caught exception after scanning through 0 ops from /data/1/dfs/current/ed
> its_inprogress_0000000000000661942 while determining its valid length. 
> Position was 761856
> java.io.IOException: Can't scan a pre-transactional edit log.
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogOp$LegacyReader.scanOp(FSEditLogOp.java:4592)
> at 
> org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanNextOp(EditLogFileInputStream.java:245)
> at 
> org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanEditLog(EditLogFileInputStream.java:355)
> at 
> org.apache.hadoop.hdfs.server.namenode.FileJournalManager$EditLogFile.scanLog(FileJournalManager.java:551)
> at 
> org.apache.hadoop.hdfs.qjournal.server.Journal.scanStorageForLatestEdits(Journal.java:193)
> at org.apache.hadoop.hdfs.qjournal.server.Journal.<init>(Journal.java:153)
> at 
> org.apache.hadoop.hdfs.qjournal.server.JournalNode.getOrCreateJournal(JournalNode.java:90)
> {noformat}
> The edit file was corrupt, and one possible culprit of this error is a full 
> disk. The JournalNode can't recovered and must be resync manually from other 
> JournalNodes. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to