[jira] Commented: (HADOOP-4938) [HOD] Cleanup idle HOD clusters whose ringmaster nodes might have gone down

Hemanth Yamijala (JIRA) Tue, 23 Dec 2008 20:53:07 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-4938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12659040#action_12659040
 ]


Hemanth Yamijala commented on HADOOP-4938:
------------------------------------------

The proposal is to build an external script to detect idle clusters. This can 
be run as a cron job or other means to track clusters whose Ringmaster and/or 
JT is unreachable for a specified period of time, and deallocate such clusters 
automatically.

We can assume that if a ringmaster is alive, it will take care of deallocation 
on its own and do not need to worry about such clusters in the external script.

The script can possibly be enhanced to send an email to a cluster administrator 
if the cluster is alive even after it detects an idle cluster and attempts to 
deallocate it. The motivation for this is comes from an observation that 
sometimes the resource manager fails to deallocate a cluster whose head node 
(on which the ringmaster process is launched) is down.



> [HOD] Cleanup idle HOD clusters whose ringmaster nodes might have gone down
> ---------------------------------------------------------------------------
>
>                 Key: HADOOP-4938
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4938
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/hod
>            Reporter: Hemanth Yamijala
>
> As mentioned in HADOOP-4937, sometimes in large cluster deployments, faulty 
> nodes on which the ringmaster process comes up may go down after the cluster 
> is successfully allocated. Such clusters fail to deallocate automatically 
> even if the idleness limit of the cluster is exceeded. This is because the 
> idleness is tracked by the ringmaster process which itself has gone down.
> As large number of nodes can get held up due to this, such clusters should be 
> detected and deallocated in some manner.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4938) [HOD] Cleanup idle HOD clusters whose ringmaster nodes might have gone down

Reply via email to