[
https://issues.apache.org/jira/browse/TEZ-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jason Lowe updated TEZ-3491:
----------------------------
Attachment: TEZ-3491.001.patch
Here's a patch that just addresses the hang issue. The AM is getting out of
sync with the RM because it ignores the completion events for those containers
when they expire. This patch causes the AM to reschedule containers when they
complete if it can find outstanding requests at the same priority as the
container.
Note that there's still the significant scheduling performance issue that is
not fixed by the patch, namely the AM refusing to schedule new containers that
are lower priority than the top priority pending requests and also not
discarding them until they expire. We may integrate this fix and postpone
addressing that to another JIRA so we can at least keep the Tez jobs from
hanging completely in this scenario.
> Tez job can hang due to container priority inversion
> ----------------------------------------------------
>
> Key: TEZ-3491
> URL: https://issues.apache.org/jira/browse/TEZ-3491
> Project: Apache Tez
> Issue Type: Bug
> Affects Versions: 0.7.1
> Reporter: Jason Lowe
> Priority: Critical
> Attachments: TEZ-3491.001.patch
>
>
> If the Tez AM receives containers at a lower priority than the highest
> priority task being requested then it fails to assign the container to any
> task. In addition if the container is new then it refuses to release it if
> there are any pending tasks. If it takes too long for the higher priority
> requests to be fulfilled (e.g.: the lower priority containers are filling the
> queue) then eventually YARN will expire the unused lower priority containers
> since they were never launched. The Tez AM then never re-requests these
> lower priority containers and the job hangs because the AM is waiting for
> containers from the RM that the RM already sent and expired.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)