[jira] Created: (HADOOP-3376) [HOD] HOD should have a way to detect and deal with clusters that violate/exceed resource manager limits

Vinod Kumar Vavilapalli (JIRA) Mon, 12 May 2008 22:05:19 -0700

[HOD] HOD should have a way to detect and deal with clusters that 
violate/exceed resource manager limits
--------------------------------------------------------------------------------------------------------


                 Key: HADOOP-3376
                 URL: https://issues.apache.org/jira/browse/HADOOP-3376
             Project: Hadoop Core
          Issue Type: Bug
          Components: contrib/hod
            Reporter: Vinod Kumar Vavilapalli
            Assignee: Hemanth Yamijala


Currently If we set up resource manager/scheduler limits on the jobs submitted, 
any HOD cluster that exceeds/violates these limits may 1) get blocked/queued 
indefinitely or 2) blocked till resources occupied by old clusters get freed. 
HOD should detect these scenarios and deal intelligently, instead of just 
waiting for a long time/ for ever. This means more and proper information to 
the submitter.

(Internal) Use Case:
     If there are no resource limits, users can flood the resource manager 
queue preventing other users from using the queue. To avoid this, we could have 
various types of limits setup in either resource manager or a scheduler - max 
node limit in torque(per job limit), maxproc limit in maui (per user/class), 
maxjob limit in maui(per user/class) etc. But there is one problem with the 
current setup - for e.g if we set up maxproc limit in maui to limit the 
aggregate number of nodes by any user over all jobs, 1) jobs get queued 
indefinitely if jobs exceed max limit and 2) blocked if it asks for nodes < max 
limit, but some of the resources are already used by jobs from the same user. 
This issue addresses how to deal with scenarios like these.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Created: (HADOOP-3376) [HOD] HOD should have a way to detect and deal with clusters that violate/exceed resource manager limits

Reply via email to