[jira] [Commented] (TEZ-3491) Tez job can hang due to container priority inversion

Siddharth Seth (JIRA) Thu, 10 Nov 2016 12:00:11 -0800

    [ 
https://issues.apache.org/jira/browse/TEZ-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15654981#comment-15654981
 ]


Siddharth Seth commented on TEZ-3491:
-------------------------------------

Got it. Wondering if it would help to go back to the behaviour where vertices 
at the same level generate the same priority - if they have similar container 
requests. I think that minimizes the chances of hitting this. Like you pointed 
out, it is still possible if a vertex at a lower priority ends up starting 
before a vertex at a higher priority, on a parallel branch.
With the changed priority model, there's the additional change that the 
scheduler will try finishing tasks for a certain vertex, instead of randomly 
allocating tasks across vertices at the same level. That was a change we wanted 
to achieve - via different means though.

bq. I was thinking we should treat newly allocated containers like the 
non-reuse case when they arrive, i.e.: we lookup task requests at the 
container's priority and assign them even if they aren't the highest priority.
This would result in preemption if containers are not available for tasks at a 
higher priority level, correct? Makes sense to me - it's better than holding on 
to them and doing nothing.

I'd prefer ignoring priorities while assigning containers. Track situations 
where this happen, or check the pending request table to see if requests are 
outstanding with YARN - and make additional requests. Do you think that will 
work?


> Tez job can hang due to container priority inversion
> ----------------------------------------------------
>
>                 Key: TEZ-3491
>                 URL: https://issues.apache.org/jira/browse/TEZ-3491
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.7.1
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Critical
>             Fix For: 0.9.0, 0.8.5
>
>         Attachments: TEZ-3491.001.patch
>
>
> If the Tez AM receives containers at a lower priority than the highest 
> priority task being requested then it fails to assign the container to any 
> task.  In addition if the container is new then it refuses to release it if 
> there are any pending tasks.  If it takes too long for the higher priority 
> requests to be fulfilled (e.g.: the lower priority containers are filling the 
> queue) then eventually YARN will expire the unused lower priority containers 
> since they were never launched.  The Tez AM then never re-requests these 
> lower priority containers and the job hangs because the AM is waiting for 
> containers from the RM that the RM already sent and expired.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-3491) Tez job can hang due to container priority inversion

Reply via email to