[ 
https://issues.apache.org/jira/browse/TEZ-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15654333#comment-15654333
 ] 

Jason Lowe commented on TEZ-3491:
---------------------------------

Filed TEZ-3535 to track improving what the scheduler does when it receives 
containers that are lower priority than the top priority task requests.

bq. What scenario led to this situation. YARN ended up handing out a lower 
priority ask before a higher priority ask?, or did the AM decide to release 
containers that it had obtained at a higher priority, or multiple DAGs in the 
same AM?

It was a resource request race.  This occurred in a DAG that had a lot of root 
vertices.  Vertex 0 had a relatively slow input initializer, so all the other 
root vertices initialized a few hundred milliseconds before vertex 0.  That 
allowed the YarnTaskScheduler to receive all the task requests for the lower 
priority root vertices and request them from the RM.  By the time those 
hundreds of containers arrived, vertex 0 had finished initializing and made a 
request for thousands of top-priority tasks.  So YarnTaskScheduler refused to 
assign the lower priority containers to the top-priority tasks and held onto 
them.  It took more than 10 minutes for the thousands of top-priority tasks 
from vertex 0 to be scheduled (the hundreds of low-priority, unused container 
allocations weren't helping here), and that caused the RM to expire the lower 
priority allocations.  When vertex 0 finally finished the containers weren't 
reusable for the other vertices (due to locality reuse constraints), so the job 
hung.  The expired, lower priority containers were never re-requested, so the 
Tez AM thought it was still going to receive allocations that the RM had 
already granted and expired.

Note that this could occur in other DAGs between parallel paths in the graph.  
Just as one path starts requesting lower priority tasks the other path also 
requests higher priority tasks.  If the scheduler heartbeats in-between those 
request batches, the RM can grant containers that are lower priority than the 
highest priority task request.  The scheduler holds onto those, and we get poor 
performance.

bq. You had mentioned an alternate approach / potential improvement to solve 
the same problem in an offline discussion.

I was thinking we should treat newly allocated containers like the non-reuse 
case when they arrive, i.e.: we lookup task requests at the container's 
priority and assign them even if they aren't the highest priority.  YARN 
shouldn't be giving us those containers unless it has already allocated 
containers for all other higher-priority requests or there's a request race 
(like above).  I argue that we can just assign them to their natural priority 
requests directly even if there's a request race since we have to solve that 
problem anyway.  We already can get into situations where lower priority tasks 
are requested, allocated, and assigned before higher priority requests arrive, 
so this would be no different.  Certainly holding onto the lower priority 
containers for an indefinite time period is not the right thing to do.  If we 
can't assign them then we should release them, but assigning them would be 
preferable.  If the new containers can't be assigned because there are no 
outstanding requests at that priority then we can fall back to the normal reuse 
logic to try to use them.  I have a prototype patch that I can post to TEZ-3535.

Thanks for the review!  I'll commit this later today.

> Tez job can hang due to container priority inversion
> ----------------------------------------------------
>
>                 Key: TEZ-3491
>                 URL: https://issues.apache.org/jira/browse/TEZ-3491
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.7.1
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Critical
>         Attachments: TEZ-3491.001.patch
>
>
> If the Tez AM receives containers at a lower priority than the highest 
> priority task being requested then it fails to assign the container to any 
> task.  In addition if the container is new then it refuses to release it if 
> there are any pending tasks.  If it takes too long for the higher priority 
> requests to be fulfilled (e.g.: the lower priority containers are filling the 
> queue) then eventually YARN will expire the unused lower priority containers 
> since they were never launched.  The Tez AM then never re-requests these 
> lower priority containers and the job hangs because the AM is waiting for 
> containers from the RM that the RM already sent and expired.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to