[
https://issues.apache.org/jira/browse/TEZ-808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14955389#comment-14955389
]
Jason Lowe commented on TEZ-808:
--------------------------------
Just ran across the lack of this for some Tez jobs that hung forever. Tasks
were stuck and not making progress, but the heartbeat handler thread kept
pinging the AM. This is a significant regression from MapReduce since it can
manifest as a job that hangs forever and has to be manually killed. We should
minimally have some kind of status that is sent as part of the heartbeat
indicating that inputs are being consumed and/or outputs are being generated
since the last status. That way we can flag tasks that stop processing and the
AM can kill them after a configurable timeout akin to mapreduce.task.timeout in
MapReduce.
> Handle task attempts that are not making progress
> -------------------------------------------------
>
> Key: TEZ-808
> URL: https://issues.apache.org/jira/browse/TEZ-808
> Project: Apache Tez
> Issue Type: Sub-task
> Reporter: Bikas Saha
>
> If a task attempt is not making progress then it may cause the job to hang.
> We may want to kill and restart the attempt. With speculation support and
> free resources we may want to run another version in parallel.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)