[
https://issues.apache.org/jira/browse/HDFS-3771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13489901#comment-13489901
]
Tsz Wo (Nicholas), SZE commented on HDFS-3771:
----------------------------------------------
It does look like that HDFS-2824 should fix this. Thanks.
> Namenode can't restart due to corrupt edit logs, timing issue with shutdown
> and edit log rolling
> ------------------------------------------------------------------------------------------------
>
> Key: HDFS-3771
> URL: https://issues.apache.org/jira/browse/HDFS-3771
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: name-node
> Affects Versions: 0.23.3, 2.0.0-alpha
> Environment: QE, 20 node Federated cluster with 3 NNs and 15 DNs,
> using Kerberos based security
> Reporter: patrick white
> Priority: Critical
>
> Our 0.23.3 nightly HDFS regression suite encountered a particularly nasty
> issue recently, which resulted in the cluster's default Namenode being unable
> to restart, this was on a 20 node Federated cluster with security. The cause
> appears to be that the NN was just starting to roll its edit log when a
> shutdown occurred, the shutdown was intentional to restart the cluster as
> part of an automated test.
> The tests that were running do not appear to be the issue in themselves, the
> cluster was just wrapping up an adminReport subset and this failure case has
> not reproduce so far, nor was it failing previously. It looks like a chance
> occurrence of sending the shutdown just as the edit log roll was begun.
> From the NN log, the following sequence is noted:
> 1. an InvalidateBlocks operation had completed
> 2. FSNamesystem: Roll Edit Log from [Secondary Namenode IPaddr]
> 3. FSEditLog: Ending log segment 23963
> 4. FSEditLog: Starting log segment at 23967
> 4. NameNode: SHUTDOWN_MSG
> => the NN shuts down and then is restarted...
> 5. FSImageTransactionalStorageInspector: Logs beginning at txid 23967 were
> are all in-progress
> 6. FSImageTransactionalStorageInspector: Marking log at
> /grid/[PATH]/edits_inprogress_0000000000000023967 as corrupt since it has no
> transactions in it.
> 7. NameNode: Exception in namenode join
> [main]java.lang.IllegalStateException: No non-corrupt logs for txid 23967
> => NN start attempts continue to cycle trying to restart but can't, failing
> on the same exception due to lack of non-corrupt edit logs
> If observations are correct and issue is from shutdown happening as edit logs
> are rolling, does the NN have an equivalent to the conventional fs 'sync'
> blocking action that should be called, or perhaps has a timing hole?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira