[ 
https://issues.apache.org/jira/browse/TEZ-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15655015#comment-15655015
 ] 

Jason Lowe commented on TEZ-3491:
---------------------------------

bq. I'd prefer ignoring priorities while assigning containers. Track situations 
where this happen, or check the pending request table to see if requests are 
outstanding with YARN - and make additional requests. Do you think that will 
work?

I think it could.  Probably a more involved change but I think it would execute 
more efficiently in practice.  If we go that route then I would argue Tez 
should avoid using priorities in YARN as much as possible.  Allocate all 
containers at the same priority and assign them based on task priority.  This 
doesn't work in practice due to the 
different-container-sizes-at-the-same-priority limitation in YARN that was only 
recently fixed, so it would still need to change the priority when the 
containers are different sizes.

We can continue the discussion on TEZ-3535.

> Tez job can hang due to container priority inversion
> ----------------------------------------------------
>
>                 Key: TEZ-3491
>                 URL: https://issues.apache.org/jira/browse/TEZ-3491
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.7.1
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Critical
>             Fix For: 0.9.0, 0.8.5
>
>         Attachments: TEZ-3491.001.patch
>
>
> If the Tez AM receives containers at a lower priority than the highest 
> priority task being requested then it fails to assign the container to any 
> task.  In addition if the container is new then it refuses to release it if 
> there are any pending tasks.  If it takes too long for the higher priority 
> requests to be fulfilled (e.g.: the lower priority containers are filling the 
> queue) then eventually YARN will expire the unused lower priority containers 
> since they were never launched.  The Tez AM then never re-requests these 
> lower priority containers and the job hangs because the AM is waiting for 
> containers from the RM that the RM already sent and expired.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to