[
https://issues.apache.org/jira/browse/HADOOP-2491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12558734#action_12558734
]
Sanjay Radia commented on HADOOP-2491:
--------------------------------------
Hadoop-2510's analysis that a job should not get a private job-cluster but
merely an ability to run tasks on nodes that have free capacity makes sense.
The reservation of a job-cluster is one of the key causes of the low
utilization in the current system.
BTW, as a few others have noted the scheduling function belongs as a separate
layer rather than being part of MR. Hence I am commenting on this Jira instead
of in 2510.
While are I see the simplicity of having one scheduler I think we may not quite
get away with that. I believe we will need two schedulers.
The job scheduler's role is to move a submitted job into the run-queue of the
grid when the grid has sufficient resources to be able complete the job
satisfactorily. Once in the run queue, the job generates tasks which are
scheduled by the task scheduler. Without a job scheduler, too many jobs may
fight for running tasks and all of them progress too slowly. E.g. consider two
very large jobs that don't fit in the grid when run together. We need a way for
the system to automatically schedule these 2 very large jobs at different
times. E.g. Large-job1 can run with a bunch of smaller jobs and later
Large-job2 and run with other small jobs. At the very least the job scheduler
admits jobs into the run queue only if the number of tasks in the task queue is
small
BTW I suspect that for map-reduce jobs, we may be able to get away with a very
simplistic job-scheduler that uses priorities and takes advantage of the fact
that all the mappers are created initially and the reducers follow the
mappers. Hence if the task queues are priority based and FCFS and furthermore
reduce tasks are give a higher priority then things may work with a simple job
scheduler. But more complex jobs may need a sophisticated job scheduler. My
main point is that the abstraction of a job scheduler is needed.
Comparison of Job Grid to Service Grid
It is also worth comparing the job grid to a service grid. In the end it would
be desirable to have a single grid that is used for both jobs and service.
(Hence the scheduling function should be moved out of MR in a lower layer.)
There are many similarities and some differences between services and jobs and
their corresponding grids.
So what is a service and a service grid? A service is like a "persistent job";
a service runs "forever" (more accurately it runs till it is removed).
Acme.com's web site is an example of a service (actually it is probably 3
services representing the typical 3-tier structure. The service is persistent.
Most of the interesting service are horizontally scaled - a service has service
instances (e.g. the instances of the apache servers of acme.com's web service);
the number of service instances shrink and grow as the load on the service
shrinks and grows. Hence the service instances are like the tasks. Indeed the
task scheduler could easily schedule the service instances.
A service has a Service Level Objective (SLO) manager that monitors the load on
a service and the capacity it needs to support that load; when the capacity is
not sufficient the SLO manager requests additional resources. Perhaps this is
like the job manager (one usually keeps the service manager and its SLO-manager
separate for various reasons.)
Is there a concept of a service scheduler (analogous to job scheduler) in a
service grid? Note that unlike a job grid, the set of services in a
service-grid do not change as rapidly as the set of jobs in a job grid. The
services tend to be fairly persistent while their sets of service instances
(i.e. their tasks) are very dynamic. There is a capacity planning function in a
service grid which is a fairly manual function in current generation of service
grids. The capacity planning function is suppose to ascertain that the SLOs of
the services can be satisfied. Since one does not want to configure the service
grid with the max resources for each service one has deal with occasional
resource shortage. The good news is that often the service mix is such that the
peak load on all the services do nit occur at the same time. While that helps,
one still has to deal resource shortage; many systems have service priorities
and move resources from low priority services to higher priority services
(while still leaving the low priority services with a min set of resources as
defined in its SLO specification.) Automated service capacity planning is still
in early research stage.
Getting back to jobs.
To recap
• The proposal of getting rid of the notion of a job cluster and
dynamically scheduling the tasks of job to the available resources on the
compute nodes as described in this JIRA makes a lot of sense
• The notion of task-level scheduling makes a lot of sense. Indeed I
believe we can use the same task scheduler for service instances (i.e. the
tasks of a service) scheduling. We should call it a task scheduler - this Jira
repeatedly points out that the "job scheduler" is really the "task scheduler.
Just call it that.
• WE may want the task-scheduler to take action on related tasks of a job
(e.g. gang scheduling) and hence the job id should be a visible attribute of a
task.
• Perhaps we need to introduce a notion of a Job-SLO - this is something
that defines the resource requirements of job. Furthermore map-reduce jobs may
have to be modeled as 2 sub-jobs with different Job-SLOs and priorities.
• Need a job scheduler in addition to the task scheduler. The job
scheduler will admit jobs into the grid's run queue when Job-SLOs can be
satisfactorily met. It is possible that we may get away with a very simplistic
job scheduler for map-reduce jobs. The abstraction of a job scheduler appears
to be useful.
• May need to add support for multi-phase jobs: each a series of
map-reduce jobs where the output of is fed into another. Such multi-phased jobs
may need to be scheduled as a collection.
• The task tracker concept can probably be made more generic. An
important function of the task tracker is to monitor the load of their
respective nodes. It is desirable for tasks to be scheduled not merely on
number of task slots but on the actual load on the node.
There are similarities between services and job grids and it is worth exploring
the similarities to get the right abstractions. As far as jobs and services
living in the same grid - further investigation is needed.
> generalize the TT / JT servers to handle more generic tasks
> -----------------------------------------------------------
>
> Key: HADOOP-2491
> URL: https://issues.apache.org/jira/browse/HADOOP-2491
> Project: Hadoop
> Issue Type: Improvement
> Components: mapred
> Reporter: eric baldeschwieler
>
> We've been discussing a proposal to generalize the TT / JT servers to handle
> more generic tasks and move job specific work out of the job tracker and into
> client code so the whole system is both much more general and has more
> coherent layering. The result would look more like condor/pbs like systems
> (or presumably borg) with map-reduce as a user job.
> Such a system would allow the current map-reduce code to coexist with other
> work-queuing libraries or maybe even persistent services on the same Hadoop
> cluster, although that would be a stretch goal. We'll kick off a thread with
> some documents soon.
> Our primary goal in going this way would be to get better utilization out of
> map-reduce clusters and support a richer scheduling model. The ability to
> support alternative job frameworks would just be gravy!
> ----
> Putting this in as a place holder. Hope to get folks talking about this to
> post some more detail.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.