[
https://issues.apache.org/jira/browse/HDFS-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13248621#comment-13248621
]
Hari Mankude commented on HDFS-3217:
------------------------------------
bq.I disagree. It is an explicit decision to not have the ZKFC act as a service
supervisor, because it adds a lot of complexity. There already exist lots of
solutions for service management - we assume that the user is already using
something like puppet, daemontools, supervisord, cron, etc, to make sure the
daemon restarts eventually.
I did not find a reference to an external monitoring tool in the HA design
docs. So apologies there. If the scanning interval of the external tools is
significant, it might still make sense for FC to restart the NN directly. With
one of the NN processes down, the cluster is functioning in a degraded state
and the longer it takes to restart the standby NN process, longer the recovery
time is going to be.
> ZKFC should restart NN when healthmonitor gets a SERVICE_NOT_RESPONDING
> exception
> ---------------------------------------------------------------------------------
>
> Key: HDFS-3217
> URL: https://issues.apache.org/jira/browse/HDFS-3217
> Project: Hadoop HDFS
> Issue Type: Sub-task
> Components: auto-failover, ha
> Reporter: Hari Mankude
> Assignee: Hari Mankude
>
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira