[
https://issues.apache.org/jira/browse/HDFS-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13246819#comment-13246819
]
Todd Lipcon commented on HDFS-2185:
-----------------------------------
bq. I'm not quite sure how it can be guaranteed. NN cannot be aware of who
issues a transition, right?
My plan was to add an enum flag to the RPCs like {{transitionToActive}} and
{{transitionToStandby}} that would indicate who sent it. For example
"CLI_FAILOVER", "ZKFC_FAILOVER", or "FORCE". The force option would be there so
that if the admin *really* knows what he/she is doing, they could override the
safety check. Otherwise the haadmin commands can prevent users from
accidentally shooting themselves in the foot.
bq. I still think it makes sense to ops to have an option to turn on/off auto
failover on-demand. In case of ZKFC issues, we still can have an alternative
way to bypass it. However I'm neither sure it would help ops or confuse them.
Thats a good point - it's useful for emergency situations. I think we can solve
this with docs, though -- if you want to stop automatic failovers, you need to
first shut down the standby ZKFCs, then the active ZKFC. If you bring them down
in the other order, it won't break things, but you might get a failover in the
process. I think adding a programatic way to do this is a future improvement.
> HA: HDFS portion of ZK-based FailoverController
> -----------------------------------------------
>
> Key: HDFS-2185
> URL: https://issues.apache.org/jira/browse/HDFS-2185
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Components: auto-failover, ha
> Affects Versions: 0.24.0, 0.23.3
> Reporter: Eli Collins
> Assignee: Todd Lipcon
> Fix For: Auto failover (HDFS-3042)
>
> Attachments: Failover_Controller.jpg, hdfs-2185.txt, hdfs-2185.txt,
> hdfs-2185.txt, hdfs-2185.txt, hdfs-2185.txt, zkfc-design.pdf,
> zkfc-design.pdf, zkfc-design.pdf, zkfc-design.pdf, zkfc-design.tex
>
>
> This jira is for a ZK-based FailoverController daemon. The FailoverController
> is a separate daemon from the NN that does the following:
> * Initiates leader election (via ZK) when necessary
> * Performs health monitoring (aka failure detection)
> * Performs fail-over (standby to active and active to standby transitions)
> * Heartbeats to ensure the liveness
> It should have the same/similar interface as the Linux HA RM to aid
> pluggability.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira