[ 
https://issues.apache.org/jira/browse/TEZ-808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14955389#comment-14955389
 ] 

Jason Lowe commented on TEZ-808:
--------------------------------

Just ran across the lack of this for some Tez jobs that hung forever.  Tasks 
were stuck and not making progress, but the heartbeat handler thread kept 
pinging the AM.  This is a significant regression from MapReduce since it can 
manifest as a job that hangs forever and has to be manually killed.  We should 
minimally have some kind of status that is sent as part of the heartbeat 
indicating that inputs are being consumed and/or outputs are being generated 
since the last status.  That way we can flag tasks that stop processing and the 
AM can kill them after a configurable timeout akin to mapreduce.task.timeout in 
MapReduce.

> Handle task attempts that are not making progress
> -------------------------------------------------
>
>                 Key: TEZ-808
>                 URL: https://issues.apache.org/jira/browse/TEZ-808
>             Project: Apache Tez
>          Issue Type: Sub-task
>            Reporter: Bikas Saha
>
> If a task attempt is not making progress then it may cause the job to hang. 
> We may want to kill and restart the attempt. With speculation support and 
> free resources we may want to run another version in parallel.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to