[
https://issues.apache.org/jira/browse/HDFS-7400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Allen Wittenauer updated HDFS-7400:
-----------------------------------
Status: Open (was: Patch Available)
> More reliable namenode health check to detect OS/HW issues
> ----------------------------------------------------------
>
> Key: HDFS-7400
> URL: https://issues.apache.org/jira/browse/HDFS-7400
> Project: Hadoop HDFS
> Issue Type: Improvement
> Reporter: Ming Ma
> Assignee: Ming Ma
> Labels: BB2015-05-TBR
> Attachments: HDFS-7400.patch
>
>
> We had this scenario on an active NN machine.
> * Disk array controller firmware has a bug. So disks stop working.
> * ZKFC and NN still considered the node healthy; Communications between ZKFC
> and ZK as well as ZKFC and NN are good.
> * The machine can be pinged.
> * The machine can't be sshed.
> So all clients and DNs can't use the NN. But ZKFC and NN still consider the
> node healthy.
> The question is how we can have ZKFC and NN detect such OS/HW specific issues
> quickly? Some ideas we discussed briefly,
> * Have other machines help to make the decision whether the NN is actually
> healthy. Then you have to figure out to make the decision accurate in the
> case of network issue, etc.
> * Run OS/HW health check script external to ZKFC/NN on the same machine. If
> it detects disk or other issues, it can reboot the machine for example.
> * Run OS/HW health check script inside ZKFC/NN. For example NN's
> HAServiceProtocol#monitorHealth can be modified to call such health check
> script.
> Thoughts?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)