[
https://issues.apache.org/jira/browse/HDFS-2634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13164966#comment-13164966
]
Todd Lipcon commented on HDFS-2634:
-----------------------------------
There are a couple other editlog-related issues I intend to address in this
JIRA:
- when the NN starts up in Standby mode, it's currently calling
recoverUnclosedStreams in the shared directory. This of course renames the open
file that the active NN is writing to, which is horribly incorrect.
- when the NN starts up in Standby mode, it's currently reading the in-progress
logs. This is of course important for an active to do to have the most
up-to-date namespace, but in the case of the SBN it's causing a problem: since
it's read _part_ of the inprogress log, the tailer keeps calling
{{selectInputStreams}} with a txid that's in the middle of the log segment.
This triggers all sorts of invariant checks in the edit logging code and
prevents the SBN from making any progress. It also causes double-application of
these edits when the segment becomes finalized.
The fixes are faily straightforward:
- never allow recovering unclosed streams unless you have the edit logs open
for write
- don't replay inprogress logs at startup unless you are active
In the process of investigating, I added some more safety checks throughout the
code so that it's harder to get into the incorrect state shown above, even in
the presence of bugs.
> Standby needs to ingest latest edit logs before transitioning to active`
> ------------------------------------------------------------------------
>
> Key: HDFS-2634
> URL: https://issues.apache.org/jira/browse/HDFS-2634
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Components: ha, name-node
> Affects Versions: HA branch (HDFS-1623)
> Reporter: Todd Lipcon
> Assignee: Todd Lipcon
> Priority: Critical
>
> When the standby transitions to active state, it needs to _read_ the latest
> edit logs before it reopens them for write access. Currently, the transition
> calls {{stopStandbyServices}}, which stops the tailer, but doesn't read ahead
> to the very end. This ends up leaving the shared edits dir in an inconsistent
> state where we have overlapping transaction IDs.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira