[
https://issues.apache.org/jira/browse/TEZ-808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14955483#comment-14955483
]
Jason Lowe commented on TEZ-808:
--------------------------------
Correct, in this latest case the tasks were part of Pig streaming jobs and
somehow the tasks lost track of the subprocess. The subprocess had exited a
long time ago, but the task was waiting on the next tuple that would never
arrive. Something must be broken with Pig streaming there, but Tez should not
have allowed the tasks to hang indefinitely. Each time it would hang there was
many downstream tasks waiting for the task to complete, so the overall cluster
footprint of these hung jobs was significant. Enough to clog up the queue in
some cases, preventing not only this job from completing but all others behind
it in the queue as well.
We have lots of MapReduce jobs that run custom user code connecting to various
services outside of the cluster or other custom, non-filesystem processing.
Having framework-enforced task timeouts is critical to prevent various network
or user-code errors in these tasks from hanging the entire job, clogging up the
cluster with wasted resources until someone manually comes along and cleans up.
Otherwise we're at the mercy of the user code to play nice and hope we don't
hang the task in some way where it can still heartbeat to the AM but is
effectively dead.
> Handle task attempts that are not making progress
> -------------------------------------------------
>
> Key: TEZ-808
> URL: https://issues.apache.org/jira/browse/TEZ-808
> Project: Apache Tez
> Issue Type: Sub-task
> Reporter: Bikas Saha
>
> If a task attempt is not making progress then it may cause the job to hang.
> We may want to kill and restart the attempt. With speculation support and
> free resources we may want to run another version in parallel.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)