[ 
https://issues.apache.org/jira/browse/HADOOP-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kumar Vavilapalli updated HADOOP-3376:
--------------------------------------------

    Attachment: HADOOP-3376

Attaching a patch.

 - This implements changes required in HOD to deal better with clusters 
exceeding resource manager or scheduler limits.
 - After this, every time HOD detects that the cluster is still queued, HOD 
calls isJobFeasible method of resource manager interface 
(src/contrib/hod/hodlib/Hod/nodePool.py) to check if job can run if at all.
 - Torque implementation of isJobFeasible 
(src/contrib/hod/hodlib/NodePools/torque.py) uses the comment field in qstat 
output. When this comment field becomes equal to 
hodlib.Common.util.TORQUE_USER_LIMITS_COMMENT_FIELD, HOD deallocates the 
cluster with the error message "Request execeeded maximum user limits. Cluster 
will not be allocated." . As it is, this is still only part of the solution - 
torque comment field has to be set to the above string either by a scheduler or 
by an external tool.
 - Also introducing a hod config parameter which will enable the above checking 
: check-job-feasibility. This defaults to false and specifies whether or not to 
check job feasibility - resource manager and/or scheduler limits.
 - This patches also replaces a few 'job' strings by the string 'cluster'.

> [HOD] HOD should have a way to detect and deal with clusters that 
> violate/exceed resource manager limits
> --------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3376
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3376
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hod
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>         Attachments: HADOOP-3376
>
>
> Currently If we set up resource manager/scheduler limits on the jobs 
> submitted, any HOD cluster that exceeds/violates these limits may 1) get 
> blocked/queued indefinitely or 2) blocked till resources occupied by old 
> clusters get freed. HOD should detect these scenarios and deal intelligently, 
> instead of just waiting for a long time/ for ever. This means more and proper 
> information to the submitter.
> (Internal) Use Case:
>      If there are no resource limits, users can flood the resource manager 
> queue preventing other users from using the queue. To avoid this, we could 
> have various types of limits setup in either resource manager or a scheduler 
> - max node limit in torque(per job limit), maxproc limit in maui (per 
> user/class), maxjob limit in maui(per user/class) etc. But there is one 
> problem with the current setup - for e.g if we set up maxproc limit in maui 
> to limit the aggregate number of nodes by any user over all jobs, 1) jobs get 
> queued indefinitely if jobs exceed max limit and 2) blocked if it asks for 
> nodes < max limit, but some of the resources are already used by jobs from 
> the same user. This issue addresses how to deal with scenarios like these.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to