[ 
https://issues.apache.org/jira/browse/HDFS-14557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16899685#comment-16899685
 ] 

Stephen O'Donnell commented on HDFS-14557:
------------------------------------------

[~jojochuang], thanks for the review. You are correct, in that if the disk is 
full and remains full, the admin will need to step in. However, the Journal 
will abort when it originally hits the disk space issue, and the reason for the 
crash should be fairly obvious from the logs. The problem is then, that the 
administrator will clear some space and the journal will refused to start. This 
patch will fix that. In practice, this often happens when the journal disk is 
shared with something else, and that something else used the space and freed it 
again.

I changed the exception message to the following, which is hopefully less 
cryptic:

"No header present in log (value is -1), probably due to disk space issues when 
it was created. The log has no transactions and will be sidelined."

This is how it would look in the logs:

{code}
2019-08-04 21:25:05,647 [main] WARN  namenode.EditLogInputStream 
(EditLogFileInputStream.java:scanEditLog(349)) - Log file 
/Users/sodonnell/source/upstream_hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/TestJournal/current/edits_inprogress_0000000000000000006
 has no valid header
org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream$LogHeaderCorruptException:
 No header present in log (value is -1), probably due to disk space issues when 
it was created. The log has no transactions and will be sidelined.
        at 
org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.init(EditLogFileInputStream.java:172)
        at 
org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.getVersion(EditLogFileInputStream.java:292)
        at 
org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanEditLog(EditLogFileInputStream.java:347)
        ...
2019-08-04 21:25:05,649 [main] INFO  server.Journal 
(Journal.java:scanStorageForLatestEdits(232)) - Latest log is 
EditLogFile(file=/Users/sodonnell/source/upstream_hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/TestJournal/current/edits_inprogress_0000000000000000006,first=0000000000000000006,last=-000000000000012345,inProgress=true,hasCorruptHeader=true)
 ; journal id: test-journal
2019-08-04 21:25:05,649 [main] WARN  server.Journal 
(Journal.java:scanStorageForLatestEdits(235)) - Latest log 
EditLogFile(file=/Users/sodonnell/source/upstream_hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/TestJournal/current/edits_inprogress_0000000000000000006,first=0000000000000000006,last=-000000000000012345,inProgress=true,hasCorruptHeader=true)
 has no transactions. moving it aside and looking for previous log ; journal 
id: test-journal
2019-08-04 21:25:05,661 [main] INFO  server.Journal 
(Journal.java:scanStorageForLatestEdits(232)) - Latest log is 
EditLogFile(file=/Users/sodonnell/source/upstream_hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/TestJournal/current/edits_0000000000000000001-0000000000000000005,first=0000000000000000001,last=0000000000000000005,inProgress=false,hasCorruptHeader=false)
 ; journal id: test-journal
{code}

I have also added an additional safety check in FSEditLogLoader.scanEditLog() 
incase there is another way this problem can happen leading to the infinite 
loop we see with this issue. The code will now break out of scanning a log if 
the the resync call does not result in a new file position that is greater than 
the current one.

> JournalNode error: Can't scan a pre-transactional edit log
> ----------------------------------------------------------
>
>                 Key: HDFS-14557
>                 URL: https://issues.apache.org/jira/browse/HDFS-14557
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: ha
>    Affects Versions: 2.6.0
>            Reporter: Wei-Chiu Chuang
>            Assignee: Stephen O'Donnell
>            Priority: Major
>         Attachments: HDFS-14557.001.patch, HDFS-14557.002.patch
>
>
> We saw the following error in JournalNodes a few times before.
> {noformat}
> 2016-09-22 12:44:24,505 WARN org.apache.hadoop.hdfs.server.namenode.FSImage: 
> Caught exception after scanning through 0 ops from /data/1/dfs/current/ed
> its_inprogress_0000000000000661942 while determining its valid length. 
> Position was 761856
> java.io.IOException: Can't scan a pre-transactional edit log.
> at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogOp$LegacyReader.scanOp(FSEditLogOp.java:4592)
> at 
> org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanNextOp(EditLogFileInputStream.java:245)
> at 
> org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.scanEditLog(EditLogFileInputStream.java:355)
> at 
> org.apache.hadoop.hdfs.server.namenode.FileJournalManager$EditLogFile.scanLog(FileJournalManager.java:551)
> at 
> org.apache.hadoop.hdfs.qjournal.server.Journal.scanStorageForLatestEdits(Journal.java:193)
> at org.apache.hadoop.hdfs.qjournal.server.Journal.<init>(Journal.java:153)
> at 
> org.apache.hadoop.hdfs.qjournal.server.JournalNode.getOrCreateJournal(JournalNode.java:90)
> {noformat}
> The edit file was corrupt, and one possible culprit of this error is a full 
> disk. The JournalNode can't recovered and must be resync manually from other 
> JournalNodes. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to