[
https://issues.apache.org/jira/browse/HADOOP-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vinod Kumar Vavilapalli updated HADOOP-3376:
--------------------------------------------
Attachment: HADOOP-3376.1
Made the suggested changes. Also updated documentation.
- When asking for resources>max limits, it prints "Request exceeded maximum
user limits. CurentUsage:%s, Requested:%s, MaxLimit:%s" at critical log level
and deletes the cluster.
- When request is within limits but cumulative usage crosses limits, it
prints "Request exceeded maximum user limits. CurentUsage:%s, Requested:%s,
MaxLimit:%s. This cluster will remain queued till old clusters free resources"
at info level and stays in the queued state.
- Replaced job-feasibity config parameter with job-feasibility-attr :
specifies whether to check job feasibility - resource manager and/or scheduler
limits, also gives the attribute value. It defaults to
TORQUE_USER_LIMITS_COMMENT_FIELD which is "User-limits exceeded.
Requested:([0-9]*) Used:([0-9]*) MaxLimit:([0-9]*).
- Made necessary changes in checklimits and putting it now in hod/support dir.
> [HOD] HOD should have a way to detect and deal with clusters that
> violate/exceed resource manager limits
> --------------------------------------------------------------------------------------------------------
>
> Key: HADOOP-3376
> URL: https://issues.apache.org/jira/browse/HADOOP-3376
> Project: Hadoop Core
> Issue Type: Bug
> Components: contrib/hod
> Reporter: Vinod Kumar Vavilapalli
> Assignee: Vinod Kumar Vavilapalli
> Attachments: checklimits.sh, HADOOP-3376, HADOOP-3376.1
>
>
> Currently If we set up resource manager/scheduler limits on the jobs
> submitted, any HOD cluster that exceeds/violates these limits may 1) get
> blocked/queued indefinitely or 2) blocked till resources occupied by old
> clusters get freed. HOD should detect these scenarios and deal intelligently,
> instead of just waiting for a long time/ for ever. This means more and proper
> information to the submitter.
> (Internal) Use Case:
> If there are no resource limits, users can flood the resource manager
> queue preventing other users from using the queue. To avoid this, we could
> have various types of limits setup in either resource manager or a scheduler
> - max node limit in torque(per job limit), maxproc limit in maui (per
> user/class), maxjob limit in maui(per user/class) etc. But there is one
> problem with the current setup - for e.g if we set up maxproc limit in maui
> to limit the aggregate number of nodes by any user over all jobs, 1) jobs get
> queued indefinitely if jobs exceed max limit and 2) blocked if it asks for
> nodes < max limit, but some of the resources are already used by jobs from
> the same user. This issue addresses how to deal with scenarios like these.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.