[jira] Commented: (HADOOP-4938) [HOD] Cleanup idle HOD clusters whose ringmaster nodes might have gone down

Peeyush Bishnoi (JIRA) Mon, 12 Jan 2009 09:50:23 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-4938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12663026#action_12663026
 ]


Peeyush Bishnoi commented on HADOOP-4938:
-----------------------------------------

The approach to build such script is to identify the HOD allocated clusters in 
which :

1. Ringmaster is down : Use "qstat -f <jobid>"  output and get the first node 
from "exec_host" attribute of torque resource manager and poll it for UP or DOWN

2. "Resource Manager notes" field is not available :  Use "qstat -f <jobid>" 
output and find out whether "notes" attribute is available or not.

The clusters which will satisfy above two above condition will said to be 
problematic cluster . These problematic cluster need to be find out and 
resource manager job should be deleted or send the mail to administrator for 
job deletion if job has not been deleted .

Steps 1 and 2 should be carried out for all the running jobs i.e running jobs 
got from "qstat -r "

---

> [HOD] Cleanup idle HOD clusters whose ringmaster nodes might have gone down
> ---------------------------------------------------------------------------
>
>                 Key: HADOOP-4938
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4938
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/hod
>            Reporter: Hemanth Yamijala
>            Assignee: Peeyush Bishnoi
>
> As mentioned in HADOOP-4937, sometimes in large cluster deployments, faulty 
> nodes on which the ringmaster process comes up may go down after the cluster 
> is successfully allocated. Such clusters fail to deallocate automatically 
> even if the idleness limit of the cluster is exceeded. This is because the 
> idleness is tracked by the ringmaster process which itself has gone down.
> As large number of nodes can get held up due to this, such clusters should be 
> detected and deallocated in some manner.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4938) [HOD] Cleanup idle HOD clusters whose ringmaster nodes might have gone down

Reply via email to