[ 
https://issues.apache.org/jira/browse/HDFS-2634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13164966#comment-13164966
 ] 

Todd Lipcon commented on HDFS-2634:
-----------------------------------

There are a couple other editlog-related issues I intend to address in this 
JIRA:
- when the NN starts up in Standby mode, it's currently calling 
recoverUnclosedStreams in the shared directory. This of course renames the open 
file that the active NN is writing to, which is horribly incorrect.
- when the NN starts up in Standby mode, it's currently reading the in-progress 
logs. This is of course important for an active to do to have the most 
up-to-date namespace, but in the case of the SBN it's causing a problem: since 
it's read _part_ of the inprogress log, the tailer keeps calling 
{{selectInputStreams}} with a txid that's in the middle of the log segment. 
This triggers all sorts of invariant checks in the edit logging code and 
prevents the SBN from making any progress. It also causes double-application of 
these edits when the segment becomes finalized.

The fixes are faily straightforward:
- never allow recovering unclosed streams unless you have the edit logs open 
for write
- don't replay inprogress logs at startup unless you are active

In the process of investigating, I added some more safety checks throughout the 
code so that it's harder to get into the incorrect state shown above, even in 
the presence of bugs.
                
> Standby needs to ingest latest edit logs before transitioning to active`
> ------------------------------------------------------------------------
>
>                 Key: HDFS-2634
>                 URL: https://issues.apache.org/jira/browse/HDFS-2634
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: ha, name-node
>    Affects Versions: HA branch (HDFS-1623)
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>            Priority: Critical
>
> When the standby transitions to active state, it needs to _read_ the latest 
> edit logs before it reopens them for write access. Currently, the transition 
> calls {{stopStandbyServices}}, which stops the tailer, but doesn't read ahead 
> to the very end. This ends up leaving the shared edits dir in an inconsistent 
> state where we have overlapping transaction IDs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to