[ 
https://issues.apache.org/jira/browse/HADOOP-3421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12607142#action_12607142
 ] 

Vivek Ratan commented on HADOOP-3421:
-------------------------------------


>> 1) I would be interested to know why preempting a running job has been 
>> excluded from the design. I can imagine use cases (say in a production 
>> environment) where a high-priority job should be executed basically 
>> instantaneously at the expense of running jobs at lower priority.

You're right - there are legitimate use cases where you want to preempt running 
jobs for others. But we need to be clear on the exact semantics for preemption, 
especially since we're doing task level scheduling. What Req 3.3 implies is 
that if a job with a higher priority comes in while a job with a lower priority 
is running (i.e., some of its tasks have run, or are running), then going 
forward,  the runnable tasks of the higher priority job will be scheduled 
before the runnable tasks of the lower priority job. What we will not do is 
kill the running tasks of the lower priority job (except in the case of Req 
1.6). This becomes an issue only if the tasks of the lower priority job are 
long running. Essentially, what we're trying to limit in V1 is the killing of 
running tasks. Otherwise, you do get preemption in the sense that tasks of the 
higher priority jobs run earlier, or to put it more generically, the higher 
priority job has higher/earlier access to a queue's resources than the lower 
priority job. The flip side here is that you will may end up with a large 
number of running jobs (which are jobs with at least one task running or having 
run). This can cause a strain in resources (running jobs use up temp disk space 
and have a higher memory footprint in today's JT). 

>> 2) Will the job scheduler at least be able to kill individual tasks of 
>> running jobs to get resources back?

Only to satisfy Req 1.6 for now. It's possible that we kill tasks in more 
situations, in later versions. 

>> 3) Is 'N minutes' in 1.6 configurable per org?

Right now, the thinking is to keep things simple (just so we can get something 
out soon), which means that we're leaning towards global configuration values 
rather than per Org, in many cases. However, it'd be cool if folks can submit 
patches to add more finer-grained options. Clearly, having a per-Org value of N 
makes sense. See HADOOP-3479 for more details on the implementation effort for 
configuration. 

>> 4) 8.1: shouldn't the RM be able to allow 20k+ nodes?

The 3K value is for V1. It ties in to the scale we can support with Hadoop in 
general. Clearly, as HDFS and MR, and Hadoop as a whole, start scaling to more 
and more nodes, the Resource Manager will have to keep pace. It doesn't make 
sense to actually implement V1 to handle 20K+ nodes when other Hadoop 
components do not scale that much. But yes, the RM architecture should be able 
to handle numbers like 10K or 20K. Maybe that req can be rephrased to say that 
the 3K number is for now, and we expect it to grow to 10 K or 20K soon? 

> Requirements for a Resource Manager for Hadoop
> ----------------------------------------------
>
>                 Key: HADOOP-3421
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3421
>             Project: Hadoop Core
>          Issue Type: New Feature
>            Reporter: Vivek Ratan
>
> This is a proposal to extend the scheduling functionality of Hadoop to allow 
> sharing of large clusters without the use of HOD.  We're suffering from 
> performance issues with HOD and not finding it the right model for running 
> jobs. We have concluded that a native Hadoop Resource Manager would be more 
> useful to many people if it supported the features we need for sharing 
> clusters across large groups and organizations.
> Below are the key requirements for a Resource Manager for Hadoop. First, some 
> terminology used in this writeup: 
> * *RM*: Resource Manager. What we're building.
> * *MR*: Map Reduce.
> * A *job* is an MR job for now, but can be any request. Jobs are submitted by 
> users to the Grid. MR jobs are made up of units of computation called *tasks*.
> * A grid has a variety of *resources* of different *capacities* that are 
> allocated to tasks. For the the early version of the grid, the only resource 
> considered is a Map or Reduce slot, which can execute a task. Each slot can 
> run one or more tasks. Later versions may look at resources such as local 
> temporary storage or CPUs. 
> * *V1*: version 1. Some features are simplified for V1. 
> h3. Orgs, queues, users, jobs
> Organizations (*Orgs*) are distinct entities for administration, 
> configuration, billing and reporting purposes. *Users* belong to Orgs. Orgs 
> have *queues* of jobs, where a queue represents a collection of jobs that 
> share some scheduling criteria. 
>    * *1.1.* For V1, each queue will belong to one Org and each Org will have 
> one queue. 
>    * *1.2.* Jobs are submitted to queues. A single job can be submitted to 
> only one queue. It follows that a job will have a user and an Org associated 
> with it. 
>    * *1.3.* A user can belong to multiple Orgs and can potentially submit 
> jobs to multiple queues. 
>    * *1.4.* Orgs are guaranteed a fraction of the capacity of the grid (their 
> 'guaranteed capacity') in the sense that a certain capacity of resources will 
> be at their disposal. All jobs submitted to the queues of an Org will have 
> access to the capacity guaranteed to the Org. 
>       ** Note: it is expected that the sum of the guaranteed capacity of each 
> Org should equal the resources in the Grid. If the sum is lower, some 
> resources will not be used. If the sum is higher, the RM cannot maintain 
> guarantees for all Orgs. 
>    * *1.5.* At any given time, free resources can be allocated to any Org 
> beyond their guaranteed capacity. For example this may be in the proportion 
> of guaranteed capacities of various Orgs or some other way. However, these 
> excess allocated resources can be reclaimed and made available to another Org 
>  in order to meet its capacity guarantee.
>    * *1.6.* N minutes after an org reclaims resources, it should have all its 
> reserved capacity available. Put another way, the system will guarantee that 
> excess resources taken from an Org will be restored to it within N minutes of 
> its need for them.
>    * *1.7.* Queues have access control. Queues can specify which users are 
> (not) allowed to submit jobs to it. A user's job submission will be rejected 
> if the user does not have access rights to the queue. 
> h3. Job capacity
>    * *2.1.* Users will just submit jobs to the Grid. They do not need to 
> specify the capacity required for their jobs (i.e. how many parallel tasks 
> the job needs). [Most MR jobs are elastic and do not require a fixed number 
> of parallel tasks to run - they can run with as little or as much task 
> parallelism as they can get. This amount of task parallelism is usually 
> limited by the number of mappers required (which is computed by the system 
> and not by the user) or the amount of free resources available in the grid. 
> In most cases, the user wants to just submit a job and let the system take 
> care of utilizing as many or as little resources as it can.]
> h3. Priorities
>    * *3.1.* Jobs can optionally have priorities associated with them. For V1, 
> we support the same set of priorities available to MR jobs today. 
>    * *3.2.* Queues can optionally support priorities for jobs. By default, a 
> queue does not support priorities, in which case it will ignore (with a 
> warning) any priority levels specified by jobs submitted to it. If a queue 
> does support priorities, it will have a default priority associated with it, 
> which is assigned to jobs that don't have priorities. Reqs 3.1 and 3.2 
> together mean this: if a queue supports priorities, then a job is assigned 
> the default priority if it doesn't have one specified, else the job's 
> specified priority is used. If a queue does not support priorities, then it 
> ignores priorities specified for jobs. 
>    * *3.3.* Within a queue, jobs with higher priority will have access to the 
> queue's resources before jobs with lower priority. However, once a job is 
> running, it will not be preempted (i.e., stopped and restarted) for a higher 
> priority job. What this also means is that comparison of priorities makes 
> sense within queues, and not across them. 
> h3. Fairness/limits
>    * *4.1.* In order to prevent one or more users from monopolizing its 
> resources, each queue enforces a limit on the percentage of resources 
> allocated to a user at any given time, if there is competition for them. This 
> user limit can vary between a minimum and maximum value. For V1, all users 
> have the same limit, whose maximum value is dictated by the number of users 
> who have submitted jobs, and whose minimum value is a pre-configured value 
> UL-MIN. For example, suppose UL-MIN is 25. If two users have submitted jobs 
> to a queue, no single user can use more than 50% of the queue resources. If a 
> third user submits a job, no single user can use more than 33% of the queue 
> resources. With 4 or more users, no user can use more than 25% of the queue's 
> resources. 
>       ** Limits apply to newer scheduling, i.e., running jobs or tasks will 
> not be preempted. 
>       ** The value of UL-MIN can be set differently per Org.   
> h3. Job queue interaction
>    * *5.1.* Interaction with the Job queue should be through a command line 
> interface and a web-based GUI. 
>    * *5.2.* All queues are visible to all users. The Web UI will provide a 
> single-page view of all queues. 
>    * *5.3.* Users should be able to delete their queued jobs at any time. 
>    * *5.4.* Users should be able to see capacity statistics for various Orgs 
> (what is the capacity allocated, how much is being used, etc.)
>    * *5.5.* Existing utilities, such as the *hadoop job -list* command, 
> should be enhanced to show additional attributes that are relevant. For e.g. 
> which queue is associated with the job.
> h3. Accounting
>    * *6.1.* The RM must provide accounting information in a manner that can 
> be easily consumed by external plug-ins or utilities to integrate with 3rd 
> party accounting systems. The accounting information should comprise of the 
> following information: 
>       ** Username running the Hadoop job, 
>       ** job id, 
>       ** job name, 
>       ** queue to which job was submitted and organization owning the queue, 
>       ** number of resource units (for e.g. slots) used 
>       ** number of maps / reduces, 
>       ** timings - time of entry into the queue, start and end times of the 
> job, perhaps total node hours, 
>       ** status of the job - success, failed, killed, etc.
>    * *6.2.* To assist deployments which do not require accounting, it should 
> be possible to turn off this feature.
> h3. Availability
>    * *7.1* Job state needs to be persisted (RM restarts should not cause jobs 
> to die)
> h3. Scalability
>    * *8.1.* Scale to 3k+ nodes
>    * *8.2.* Scale to handle 1k+ large submitted jobs, each with 100k+ tasks
> h3. Configuration
>    * *9.1.* The system must provide a mechanism to create and delete 
> organizations, and queues within the organizations. It must also provide a 
> mechanism to configure various properties of these objects. 
>    * *9.2.* Only users with administrative privileges can perform operations 
> of managing and configuring these objects in the system.
>    * *9.3.* Configuration changes must be effective in the RM without 
> requiring its restart. They must be effective in a reasonable amount of time 
> since the modification is made.
>    * *9.4.* For most of the configurations, it appears that there can be 
> values at various levels - Grid, organization, queue, user and job. For e.g. 
> there can be a default value for the resource quota per user at a Grid level, 
> which can be overridden at an org level, and so on. There must be an easy way 
> to express these configurations in this hierarchical fashion. Also, values at 
> a broader level can be overridden by values at a more narrow level.
>    * *9.5.* There must be appropriate default objects and default values for 
> their configuration. This is to help deployments that do not need a complex 
> scheduling setup.
> h3. Logging Enhancements
>    * *10.1.* For purposes of debugging, the Hadoop web UI should provide a 
> facility to see details of all jobs. While this is mostly supported today, 
> any changes to meet other requirements, such as scalability, must not affect 
> this feature. Also, it must be possible to view task logs from Job history UI 
> (see HADOOP:2165)
>    * *10.2.* The system must log all relevant events about a job vis-a-vis 
> scheduling. Particularly, changes in state of a job (queued -> scheduled -> 
> completed | killed), and events which caused these changes must be logged.
>    * *10.3.* The system should be able to provide relevant, explanatory 
> information about the status of job to give feedback to users. This could be 
> a diagnostic string such as why the job is queued or why it failed. (For e.g. 
> lack of sufficient resources - how many were asked, how many are available, 
> exceeding user limits, etc). This information must be available to users, as 
> well as in the logs for debugging purposes. It should also be possible to 
> programmatically get this information.
>    * *10.4.* The host which submitted the job should be part of log messages. 
> This would assist in debugging.
> h3. Security Enhancements
>    * *11.1.* The RM should provide a mechanism for controlling who can submit 
> jobs to which queue. This could be done using an ACL mechanism that consists 
> of an ordered whitelist and blacklist of users. The order can determine which 
> ACL would apply in case of conflicts.
>    * *11.2.* The system must provide a mechanism to list users who have 
> administrative control. Only users in this list should be allowed to modify 
> configuration related to the RM, like configuration, setting up objects, etc.
>    * *11.3.* The system should be able to schedule tasks running on behalf of 
> multiple users concurrently on the same host in a secure manner. 
> Specifically, this should not require any insecure configuration, such as 
> requiring 0777 permissions on directories etc.
>    * *11.4.* The system must follow the security mechanisms being implemented 
> for Hadoop (HADOOP:1701 and friends).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to