[
https://issues.apache.org/jira/browse/HADOOP-3412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12612519#action_12612519
]
Vivek Ratan commented on HADOOP-3412:
-------------------------------------
Nice job, Tom. It's a lot cleaner and will make it easier for us to merge stuff
from HADOOP-3445. I did have a couple of comments:
A Scheduler is usually just an algorithm for deciding which task to pick for a
TT. It uses information from the TT (its hostname, rack, what it's currently
running, how many resources it has free), and information from what is required
(which job/queue to look at, capacities, user limits) and makes the best match.
I'm wondering if _TaskScheduler_ should handle addition/removal of jobs. As we
make the JT persistent, we need the ability to persist job state to disk, to
initialize jobs (expand their task-based structures, as in
JobInProgress::Init()) dynamically (since, in order to scale, you don't want
jobs to be expanded unless absolutely needed), to store only relevant
information in memory and the rest on disk. Something else should likely do
this, not _TaskScheduler_. _TaskScheduler_ needs to access the collection of
jobs when it runs its scheduling algorithms, but it should not be responsible
for them. Methods like _addJob()_ and _removeJob()_ probably belong to some
other class, something like a _JobQueueManager_. Which, by the way, can also
handle multiple queues of jobs, as we'll need for 3445. Maybe the JT itself can
handle the queues of jobs initially. Regardless, do you think _TaskScheduler_
should be responsible for jobs?
Another thought I had is regarding the work we're doing for 3445. It's more of
an observation than a suggestion. HADOOP-3421 introduces multiple queues,
capacity per queue, and user limits. Each of these features affects the
scheduling of tasks, which likely would go something like this: a TT indicates
that it has one or more free Map or Reduce slots. JT figures out whether to
look for a Map or Reduce task. JT needs to find a task in a job in a queue, so
it first looks at which queue to consider (primarily based on queue capacities
and whether the queue can accept a TT slot). Within the selected queue, the JT
considers which job to look at, based on user limits (the job needs to be
belong to a user who is not using more capacity than he/she is allowed) and
priorities. Finally, within a job, the JT (actually, the JobInProgress object)
needs to pick what task to run, based on data locality, speculation, and some
other heuristics. My guess is that many developers will want to plug in tweaks
to some of the pieces of entire scheduling algorithm, and not modify other
logic. Someone, for example, may want to tweak how a queue is chosen, and not
touch the other stuff. For this, IMO, we need to break down the JT's scheduling
flow (decide on M or R task, then pick a queue, then pick a job, then pick a
task), which now sits in _JobQueueTaskScheduler_, into discrete units and allow
folks to override one or more of these units. There are a few ways to do this
and we can use some of the suggestions and principles applied in this patch
there as well. I guess I'm really making a plug here for folks to look at
extending the stuff for 3445 (once it is ready) in a similar way, so that the
entire scheduling flow can be made extensible at different steps.
Once again, I don't think merging the 3445 work with this patch should be very
hard, especially with the latest patch. I'll take a look at that soon. Thanks,
Brice for your offer to help. I'm sure you can help us out there. And of
course, I'm glad you took this whole effort up and pushed it all the way to
where it is. Nice work.
> Refactor the scheduler out of the JobTracker
> --------------------------------------------
>
> Key: HADOOP-3412
> URL: https://issues.apache.org/jira/browse/HADOOP-3412
> Project: Hadoop Core
> Issue Type: Improvement
> Components: mapred
> Reporter: Brice Arnould
> Assignee: Brice Arnould
> Priority: Minor
> Fix For: 0.19.0
>
> Attachments: JobScheduler-v9.patch, JobScheduler.patch,
> JobScheduler_v2.patch, JobScheduler_v3.patch, JobScheduler_v3b.patch,
> JobScheduler_v4.patch, JobScheduler_v5.patch, JobScheduler_v6.1.patch,
> JobScheduler_v6.2.patch, JobScheduler_v6.3.patch, JobScheduler_v6.4.patch,
> JobScheduler_v6.patch, JobScheduler_v7.1.patch, JobScheduler_v7.patch,
> JobScheduler_v8.patch, RackAwareJobScheduler.java,
> SimpleResourceAwareJobScheduler.java
>
>
> First I would like warn you that my proposition is assumed to be very naive.
> I just hope that reading it won't make you lose time.
> h4. The aim
> It seems to me that improving Hadoop scheduling could be very profitable.
> But, it is hard to implement and compare schedulers, because the scheduling
> logic is mixed within the rest of the JobTracker.
> This bug is the first step of an attempt to improve the Hadoop scheduler. It
> re-implements the current scheduling algorithm in a separate class called
> JobScheduler. This new class is instantiated in the JobTracker.
> h4. Bug fixed as a side effects
> This patch probably cannot be submited as it is.
> A first difficulty is that it does not have exactly the same behaviour than
> the current JobTracker. More precisely, it doesn't re-implement things like
> code that seems to be never called or concurency problems.
> I wrote TOCONFIRM where my proposition differ from the current
> implementation, so you can find them easily.
> I know that fixing bugs silently is bad. So, independently of what you decide
> about this patch, I will open issues for bugs that you confirm.
> h4. Other side effects
> Another side effect of this patch is to add documentation about each step of
> the scheduling. I hope that it will help future improvement by lowering the
> level required to contribute to the scheduler.
> It also reduces the complexity and the granularity of the JobTracker (making
> it more parallel).
> h4. The future
> If you feel that this is a step the right direction, I will try to propose a
> JobSchedulerInterface that many JobSchedulers could implement and to propose
> alternatives to the current « FifoJobScheduler ». If some of you have ideas
> about that please tell ^^ I will also open issues for things marked as FIXME
> in the patch.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.