[
https://issues.apache.org/jira/browse/HDFS-7400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ming Ma updated HDFS-7400:
--------------------------
Attachment: HDFS-7400.patch
Thanks, [[email protected]], [~cmccabe] and [~aw] for the additional
suggestions and probes.
So far we believe it is disk controller related issue. We still need to gather
more data and repro steps, etc.
We have used health check script on worker nodes to detect various issues. It
is a good chance this script can detect such scenario. So it will be useful to
support NN health check script functionality. At least it allows us to test
things out. So here is the initial patch.
1. The health check script output uses the same format as YARN so that it is
easier to develop and maintain.
2. During the ZKFC -> NN health check RPC call, health check script will be
invoked if it is defined.
3. This is specific to NN. https://issues.apache.org/jira/browse/HDFS-7441
discussed supporting health check for DN.
4. The code could have been reused between YARN and HDFS. We can put health
check related code in hadoop-common if people prefer.
Appreciate any input.
> More reliable namenode health check to detect OS/HW issues
> ----------------------------------------------------------
>
> Key: HDFS-7400
> URL: https://issues.apache.org/jira/browse/HDFS-7400
> Project: Hadoop HDFS
> Issue Type: Improvement
> Reporter: Ming Ma
> Attachments: HDFS-7400.patch
>
>
> We had this scenario on an active NN machine.
> * Disk array controller firmware has a bug. So disks stop working.
> * ZKFC and NN still considered the node healthy; Communications between ZKFC
> and ZK as well as ZKFC and NN are good.
> * The machine can be pinged.
> * The machine can't be sshed.
> So all clients and DNs can't use the NN. But ZKFC and NN still consider the
> node healthy.
> The question is how we can have ZKFC and NN detect such OS/HW specific issues
> quickly? Some ideas we discussed briefly,
> * Have other machines help to make the decision whether the NN is actually
> healthy. Then you have to figure out to make the decision accurate in the
> case of network issue, etc.
> * Run OS/HW health check script external to ZKFC/NN on the same machine. If
> it detects disk or other issues, it can reboot the machine for example.
> * Run OS/HW health check script inside ZKFC/NN. For example NN's
> HAServiceProtocol#monitorHealth can be modified to call such health check
> script.
> Thoughts?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)