can we suspend jobs (just unix suspend) instead of killing them?
 
if we can - perhaps we don't even have to bother delaying the use of additional 
slots beyond limit.

________________________________

From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: Thu 1/10/2008 11:21 AM
To: hadoop-user@lucene.apache.org
Subject: Re: Question on running simultaneous jobs



Runping Qi wrote:
> An improvement over Doug's proposal is to make the limit soft in the
> following sense:
>
> 1. A job is entitled to run up to the limit number of tasks.
> 2. If there are free slots and no other job waits for their entitled
> slots, a job can run more tasks than the limit.
> 3. When a job runs more tasks than its limit, and a new job comes, we
> may do one of the two:
>       a) kill some of the tasks to make room for the new job.
>       b) all the running tasks run to complete. Any freed up slot will
> be assigned to the new job.

I think this would be a good second phase, as it will be trickier to
implement.

Jobs that disable speculative execution may not like having tasks killed
(although they must in general still be tolerant of it) so we might only
permit jobs with speculative execution enabled to exceed their limit.

Also there should be a delay before a job is permitted to run over its
limit, in order to give other jobs an opportunity to launch.  For
example, if a user is submitting a series of jobs, each consuming the
output of the previous, then we wouldn't want an already running job to
immediately consume all the free slots when one job completes, since
another job will soon be started that is more deserving of these slots.
  Perhaps, when portions of the cluster are idle, jobs should gradually
be permitted to exceed their limit.  Then, if new jobs are launched,
tasks should only gradually be killed, first giving them the opportunity
to finish normally.  Some tuning will probably be required to get this
right.

Ideally the limit would be dynamic, perhaps something like max(10,
#slots/#jobs), so jobs would only be queued when there are fewer than 10
slots/job.  But a static limit would still be a significant improvement
and easier to implement in the first version.

Doug


Reply via email to