[jira] [Commented] (TEZ-808) Handle task attempts that are not making progress

Jason Lowe (JIRA) Tue, 13 Oct 2015 12:09:44 -0700

    [ 
https://issues.apache.org/jira/browse/TEZ-808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14955483#comment-14955483
 ]


Jason Lowe commented on TEZ-808:
--------------------------------

Correct, in this latest case the tasks were part of Pig streaming jobs and 
somehow the tasks lost track of the subprocess.  The subprocess had exited a 
long time ago, but the task was waiting on the next tuple that would never 
arrive.  Something must be broken with Pig streaming there, but Tez should not 
have allowed the tasks to hang indefinitely.  Each time it would hang there was 
many downstream tasks waiting for the task to complete, so the overall cluster 
footprint of these hung jobs was significant.  Enough to clog up the queue in 
some cases, preventing not only this job from completing but all others behind 
it in the queue as well.

We have lots of MapReduce jobs that run custom user code connecting to various 
services outside of the cluster or other custom, non-filesystem processing.  
Having framework-enforced task timeouts is critical to prevent various network 
or user-code errors in these tasks from hanging the entire job, clogging up the 
cluster with wasted resources until someone manually comes along and cleans up. 
 Otherwise we're at the mercy of the user code to play nice and hope we don't 
hang the task in some way where it can still heartbeat to the AM but is 
effectively dead.

> Handle task attempts that are not making progress
> -------------------------------------------------
>
>                 Key: TEZ-808
>                 URL: https://issues.apache.org/jira/browse/TEZ-808
>             Project: Apache Tez
>          Issue Type: Sub-task
>            Reporter: Bikas Saha
>
> If a task attempt is not making progress then it may cause the job to hang. 
> We may want to kill and restart the attempt. With speculation support and 
> free resources we may want to run another version in parallel.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-808) Handle task attempts that are not making progress

Reply via email to