[
https://issues.apache.org/jira/browse/HADOOP-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Vinod Kumar Vavilapalli updated HADOOP-3376:
--------------------------------------------
Status: Open (was: Patch Available)
Cancelling patch to incorporate Hemanth's comments. The following things need
to be done:
- Each time cluster is checked for feasibility, two qstats are run - reduce it
to get required information within only one trip to resource manager.
- There can be two ways in which user limits can be crossed - requesting for
limits beyond the max limit, and cumulative usage crossing the max limits.
These two scenarios should be dealt with separately - in the first case,
cluster should be deallocated while in the second, cluster should not be
deallocated but users should be appropriately informed.
- Do away with the configuration variable check-job-feasibility. Instead have
the variable job-feasibility-comment, which will 1)indicate whether user limits
functionality has to used and 2) the comment field that will be set by
checklimits.sh - currently checkjob(used by checklimits.sh) prints "job [0-9]*
violates active HARD MAXPROC limit of [0-9]* for user [a-z]* (R: [0-9]*, U:
[0-9]*])"
- This patch changes behavior of getJobState. It should only return True or
False in all code paths.
- Modify the error message TORQUE_USER_LIMITS_EXCEEDED_MSG so that it also
prints the max limits so that user can modify his request.
- checklimits.sh: 1) Submit this also with in the patch as part of
src/contrib/hod/support 2) checklimits.sh should only do only one iteration
over all incomplete jobs and modify comment field according as the job crosses
the user limits. It should be left to some outside mechanism (like cron) to run
checklimits.sh repeatedly after every (some) interval of time.
> [HOD] HOD should have a way to detect and deal with clusters that
> violate/exceed resource manager limits
> --------------------------------------------------------------------------------------------------------
>
> Key: HADOOP-3376
> URL: https://issues.apache.org/jira/browse/HADOOP-3376
> Project: Hadoop Core
> Issue Type: Bug
> Components: contrib/hod
> Reporter: Vinod Kumar Vavilapalli
> Assignee: Vinod Kumar Vavilapalli
> Attachments: checklimits.sh, HADOOP-3376
>
>
> Currently If we set up resource manager/scheduler limits on the jobs
> submitted, any HOD cluster that exceeds/violates these limits may 1) get
> blocked/queued indefinitely or 2) blocked till resources occupied by old
> clusters get freed. HOD should detect these scenarios and deal intelligently,
> instead of just waiting for a long time/ for ever. This means more and proper
> information to the submitter.
> (Internal) Use Case:
> If there are no resource limits, users can flood the resource manager
> queue preventing other users from using the queue. To avoid this, we could
> have various types of limits setup in either resource manager or a scheduler
> - max node limit in torque(per job limit), maxproc limit in maui (per
> user/class), maxjob limit in maui(per user/class) etc. But there is one
> problem with the current setup - for e.g if we set up maxproc limit in maui
> to limit the aggregate number of nodes by any user over all jobs, 1) jobs get
> queued indefinitely if jobs exceed max limit and 2) blocked if it asks for
> nodes < max limit, but some of the resources are already used by jobs from
> the same user. This issue addresses how to deal with scenarios like these.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.