[jira] Commented: (HADOOP-3376) [HOD] HOD should have a way to detect and deal with clusters that violate/exceed resource manager limits

Hemanth Yamijala (JIRA) Thu, 22 May 2008 01:58:35 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-3376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12598949#action_12598949
 ]


Hemanth Yamijala commented on HADOOP-3376:
------------------------------------------

Some comments:

- I think job-feasibility-attr should be optional. Some code which depends on 
this attribute may need to check for it or change to handle it if it's not 
defined:
In torque.py.isJobFeasible, if the job-feasibility-attr is not defined, we 
would get an exception, where the info message being printed is not going to be 
very descriptive. I think it would just print 'job-feasibility-attr' and not 
information about what the error is.
__check_job_state: doesn't handle case where job-feasibility-attr is not 
defined.

- The messages now read as follows:
(In case of req. resources > max resources):
Request exceeded maximum user limits. CurentUsage:%s, Requested:%s, MaxLimit:%s
(In other case):
Request exceeded maximum user limits. CurentUsage:3, Requested:3, MaxLimit:3 
This cluster will remain queued till old clusters free resources.
The message still does not clarify the resources being exceeded.

I suggest the following:
Request number of nodes exceeded maximum user limits. Current Usage:%s, 
Requested:%s, Maximum Limit:%s. This cluster cannot be allocated now.

and

Request number of nodes exceeded maximum user limits. Current Usage:%s, 
Requested:%s, Maximum Limit:%s. This cluster allocation will succeed only after 
other clusters are deallocated.
(Note: I also corrected some typos in the message)

- The executable bit is not being turned on for support/checklimits.sh. This is 
mostly due to a bug in the ant script. For code under the contrib projects, 
only files under the bin/ folder are made executable when packaged. As this is 
not a bug in HOD, I think we should leave this as it is, but update the usage 
documentation to make it executable.

- In checklimits.sh - the sleep at the end is not required.

- In case when current usage + requested usage exceeds limits, the critical 
message is printed every 10 seconds. It should be printed only once.

Other than these, I tested checklimits and hod for both scenarios and it works 
fine.

> [HOD] HOD should have a way to detect and deal with clusters that 
> violate/exceed resource manager limits
> --------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-3376
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3376
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hod
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>         Attachments: checklimits.sh, HADOOP-3376, HADOOP-3376.1
>
>
> Currently If we set up resource manager/scheduler limits on the jobs 
> submitted, any HOD cluster that exceeds/violates these limits may 1) get 
> blocked/queued indefinitely or 2) blocked till resources occupied by old 
> clusters get freed. HOD should detect these scenarios and deal intelligently, 
> instead of just waiting for a long time/ for ever. This means more and proper 
> information to the submitter.
> (Internal) Use Case:
>      If there are no resource limits, users can flood the resource manager 
> queue preventing other users from using the queue. To avoid this, we could 
> have various types of limits setup in either resource manager or a scheduler 
> - max node limit in torque(per job limit), maxproc limit in maui (per 
> user/class), maxjob limit in maui(per user/class) etc. But there is one 
> problem with the current setup - for e.g if we set up maxproc limit in maui 
> to limit the aggregate number of nodes by any user over all jobs, 1) jobs get 
> queued indefinitely if jobs exceed max limit and 2) blocked if it asks for 
> nodes < max limit, but some of the resources are already used by jobs from 
> the same user. This issue addresses how to deal with scenarios like these.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3376) [HOD] HOD should have a way to detect and deal with clusters that violate/exceed resource manager limits

Reply via email to