[
https://issues.apache.org/jira/browse/HADOOP-4938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12659040#action_12659040
]
Hemanth Yamijala commented on HADOOP-4938:
------------------------------------------
The proposal is to build an external script to detect idle clusters. This can
be run as a cron job or other means to track clusters whose Ringmaster and/or
JT is unreachable for a specified period of time, and deallocate such clusters
automatically.
We can assume that if a ringmaster is alive, it will take care of deallocation
on its own and do not need to worry about such clusters in the external script.
The script can possibly be enhanced to send an email to a cluster administrator
if the cluster is alive even after it detects an idle cluster and attempts to
deallocate it. The motivation for this is comes from an observation that
sometimes the resource manager fails to deallocate a cluster whose head node
(on which the ringmaster process is launched) is down.
> [HOD] Cleanup idle HOD clusters whose ringmaster nodes might have gone down
> ---------------------------------------------------------------------------
>
> Key: HADOOP-4938
> URL: https://issues.apache.org/jira/browse/HADOOP-4938
> Project: Hadoop Core
> Issue Type: Improvement
> Components: contrib/hod
> Reporter: Hemanth Yamijala
>
> As mentioned in HADOOP-4937, sometimes in large cluster deployments, faulty
> nodes on which the ringmaster process comes up may go down after the cluster
> is successfully allocated. Such clusters fail to deallocate automatically
> even if the idleness limit of the cluster is exceeded. This is because the
> idleness is tracked by the ringmaster process which itself has gone down.
> As large number of nodes can get held up due to this, such clusters should be
> detected and deallocated in some manner.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.