[
https://issues.apache.org/jira/browse/TEZ-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15654333#comment-15654333
]
Jason Lowe commented on TEZ-3491:
---------------------------------
Filed TEZ-3535 to track improving what the scheduler does when it receives
containers that are lower priority than the top priority task requests.
bq. What scenario led to this situation. YARN ended up handing out a lower
priority ask before a higher priority ask?, or did the AM decide to release
containers that it had obtained at a higher priority, or multiple DAGs in the
same AM?
It was a resource request race. This occurred in a DAG that had a lot of root
vertices. Vertex 0 had a relatively slow input initializer, so all the other
root vertices initialized a few hundred milliseconds before vertex 0. That
allowed the YarnTaskScheduler to receive all the task requests for the lower
priority root vertices and request them from the RM. By the time those
hundreds of containers arrived, vertex 0 had finished initializing and made a
request for thousands of top-priority tasks. So YarnTaskScheduler refused to
assign the lower priority containers to the top-priority tasks and held onto
them. It took more than 10 minutes for the thousands of top-priority tasks
from vertex 0 to be scheduled (the hundreds of low-priority, unused container
allocations weren't helping here), and that caused the RM to expire the lower
priority allocations. When vertex 0 finally finished the containers weren't
reusable for the other vertices (due to locality reuse constraints), so the job
hung. The expired, lower priority containers were never re-requested, so the
Tez AM thought it was still going to receive allocations that the RM had
already granted and expired.
Note that this could occur in other DAGs between parallel paths in the graph.
Just as one path starts requesting lower priority tasks the other path also
requests higher priority tasks. If the scheduler heartbeats in-between those
request batches, the RM can grant containers that are lower priority than the
highest priority task request. The scheduler holds onto those, and we get poor
performance.
bq. You had mentioned an alternate approach / potential improvement to solve
the same problem in an offline discussion.
I was thinking we should treat newly allocated containers like the non-reuse
case when they arrive, i.e.: we lookup task requests at the container's
priority and assign them even if they aren't the highest priority. YARN
shouldn't be giving us those containers unless it has already allocated
containers for all other higher-priority requests or there's a request race
(like above). I argue that we can just assign them to their natural priority
requests directly even if there's a request race since we have to solve that
problem anyway. We already can get into situations where lower priority tasks
are requested, allocated, and assigned before higher priority requests arrive,
so this would be no different. Certainly holding onto the lower priority
containers for an indefinite time period is not the right thing to do. If we
can't assign them then we should release them, but assigning them would be
preferable. If the new containers can't be assigned because there are no
outstanding requests at that priority then we can fall back to the normal reuse
logic to try to use them. I have a prototype patch that I can post to TEZ-3535.
Thanks for the review! I'll commit this later today.
> Tez job can hang due to container priority inversion
> ----------------------------------------------------
>
> Key: TEZ-3491
> URL: https://issues.apache.org/jira/browse/TEZ-3491
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.7.1
> Reporter: Jason Lowe
> Assignee: Jason Lowe
> Priority: Critical
> Attachments: TEZ-3491.001.patch
>
>
> If the Tez AM receives containers at a lower priority than the highest
> priority task being requested then it fails to assign the container to any
> task. In addition if the container is new then it refuses to release it if
> there are any pending tasks. If it takes too long for the higher priority
> requests to be fulfilled (e.g.: the lower priority containers are filling the
> queue) then eventually YARN will expire the unused lower priority containers
> since they were never launched. The Tez AM then never re-requests these
> lower priority containers and the job hangs because the AM is waiting for
> containers from the RM that the RM already sent and expired.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)