[
https://issues.apache.org/jira/browse/HDFS-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13458208#comment-13458208
]
Todd Lipcon commented on HDFS-3077:
-----------------------------------
>> given we already have the journal daemons, it's trivial to generate unique
>> increasing sequence IDs
> But may still be unnecessary. May be during the code review I might find
> indeed it is trivial.
This code has been committed on the branch for about 2 months, and the relevant
patch was first on this JIRA on April 2nd. I think it's a bit late to consider
this fundamental of a re-structure now.
bq. In this case, you have leader/active(to loosely to put it) elected at zk
and then active has to establish epoch at znodes to become primary. Both of
this needs to be complete before an active becomes functional. Given the "two
things" that needs to happen, is a situation possible when one NN is active at
zk while not the primary at the journal nodes and the other NN is not active at
zk while is a primary at journal nodes
No, this is not possible, since NNs don't try to "re-acquire writer status"
(i.e start a new epoch) once they've lost it. So, even if a node thinks it is
active, if another node is _actually_ active, the first node will fail the next
time it tries to write. This will cause it to abort, regardless of whether ZK
has told it to be active or not.
Since I think it's clearer to explain with a couple examples:
Example 1: manual failover (simplest case, doesn't depend on ZK at all)
1. NN1 is active. NN2 is standby.
2. Admin issues a "failover" command, but for some reason the admin is
partitioned from NN1. So, NN1 remains in Active mode, while NN2 also enters
active mode.
3. NN2, upon entering active mode, starts a new epoch on the JournalNodes.
4. NN1, upon the next time it tries to perform a write, gets back an exception
from a quorum of nodes that its epoch is too old. Since it could not logSync()
and the shared edits dir is marked "required", it aborts.
Example 2: automatic failover with ZK and network partitions
1. NN1 is active. NN2 is standby.
2. NN1 becomes partitioned from ZooKeeper. Thus, it receives a ZooKeeper
"Disconnected" event. Because "Disconnected" is not the same as "Expired", NN1
does not immediately transition to standby. Instead, it stays in its current
state (active). Because it can still reach the JNs, it can continue writing.
3. NN2 is still connected to ZK, and thus sees that NN1's ephemeral node has
disappeared (after the ZK session timeout elapses). It then transitions itself
to active.
4. NN2, upon becoming active, starts a new epoch at the JournalNodes. As soon
as this happens, NN1 may no longer write, and aborts.
Note that in both cases, even though NN1 can still reach a quorum of JNs, it
doesn't try to start a new epoch after it has been fenced.
Does that address the concern?
> Quorum-based protocol for reading and writing edit logs
> -------------------------------------------------------
>
> Key: HDFS-3077
> URL: https://issues.apache.org/jira/browse/HDFS-3077
> Project: Hadoop HDFS
> Issue Type: New Feature
> Components: ha, name-node
> Reporter: Todd Lipcon
> Assignee: Todd Lipcon
> Fix For: QuorumJournalManager (HDFS-3077)
>
> Attachments: hdfs-3077-partial.txt, hdfs-3077.txt, hdfs-3077.txt,
> hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt, hdfs-3077.txt,
> qjournal-design.pdf, qjournal-design.pdf
>
>
> Currently, one of the weak points of the HA design is that it relies on
> shared storage such as an NFS filer for the shared edit log. One alternative
> that has been proposed is to depend on BookKeeper, a ZooKeeper subproject
> which provides a highly available replicated edit log on commodity hardware.
> This JIRA is to implement another alternative, based on a quorum commit
> protocol, integrated more tightly in HDFS and with the requirements driven
> only by HDFS's needs rather than more generic use cases. More details to
> follow.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira