[
https://issues.apache.org/jira/browse/MAPREDUCE-1204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12971080#action_12971080
]
Joydeep Sen Sarma commented on MAPREDUCE-1204:
----------------------------------------------
i am not sure how helpful it will be to solve this jira (or whether this can be
solved at all). one problem is that there is a *lot* of lag between asking
something to be preempted and getting to schedule something new on the same
slot:
- the kill action is only dispatched only the next heartbeat
- when the slot of freed - typically the TASK_CLEANUP is scheduled !!! (we are
trying to make task cleanup an option and skip it from hive)
- on the third heartbeat from the node - the slot is now free to be assigned to
a real task.
however - a lot of time has typically elapsed by now. the bigger the cluster -
the worse this problem is (both because of heartbeat periodicity if u are not
using OOB heartbeat as well as the fact that the JT quickly gets bottlenecked
processing heartbeats and they queue up). the chances are that by this time -
some other job will become higher priority for scheduling.
of course - we don't preempt for the highest priority (I mean in terms of
ordering by fair-scheduler) job to begin with (MAPREDUCE-2205). So preempted
slots anyway go to other jobs (if they are ahead in scheduling priority). So
looking at the requirements of the job requesting preemption is not necessary.
(the slot needs to be usable by any of the jobs that are >= in priority than
the job that requested preemption - which is a less stringent requirement)
> Fair Scheduler preemption may preempt tasks running in slots unusable by the
> preempting job
> -------------------------------------------------------------------------------------------
>
> Key: MAPREDUCE-1204
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1204
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Components: contrib/fair-share
> Affects Versions: 0.21.0
> Reporter: Todd Lipcon
>
> The current preemption code works by first calculating how many tasks need to
> be preempted to satisfy the min share constraints, and then killing an equal
> number of tasks from other jobs, sorted to favor killing of young tasks. This
> works fine for the general case, but there are some edge cases where this can
> cause problems.
> For example, if the preempting job has blacklisted ("marked flaky") a
> particular task tracker, and that tracker is running the youngest task,
> preemption can still kill that task. The preempting job will then refuse that
> slot, since the tracker has been blacklisted. The same task that just got
> killed then gets rescheduled in that slot. This repeats ad infinitum until a
> new slot opens in the cluster.
> I don't have a good test case for this, yet, but logically it is possible.
> One potential fix would be to add an API to JobInProgress that functions
> identically to obtainNewMapTask but does not schedule the task. The
> preemption code could then use this while iterating through the sorted
> preemption list to check that the preempting jobs can actually make use of
> the candidate slots before killing them.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.