[ 
https://issues.apache.org/jira/browse/TEZ-808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14957151#comment-14957151
 ] 

Jason Lowe commented on TEZ-808:
--------------------------------

bq. If the task was not using incremental units of cpu and disk then it would 
be flagged as stuck despite progress updates. This would catch both deadlock 
(no cpu) and livelocks (spinning cpu).
Because Tez uses a separate thread for heartbeating to the AM, it will hardly 
ever deadlock completely -- there will always be that thread waking up to ping 
the AM and generate CPU activity.  And the problem with watching the disk is 
that it assumes the task will consume local disk as part of its processing.  
Most jobs will, but I think it is possible to create input and outputs that 
wouldn't.

bq. Add logic in TezChild to track progress based on stats progress by IOs and 
the number of invocations of processorContext.setProgress(). Send this 
information to the AM which would terminate tasks that make no indications of 
progress for a configurable period of time (this jira)
Is that sufficient if we're only updating IO stats progress when it's closed?  
And will we have issues during shuffle where we haven't started processing yet, 
waiting for the last upstream task to complete?  Main concern I have is that 
this could start killing legitimate tasks left and right in practice because 
the framework isn't reporting progress often enough and nobody has updated 
their custom implementations to do so.

In practice it's rare for user-provided map or reduce methods to need explicit 
progress reporting because the framework-supplied progress reporting covers it 
in the vast majority of cases.  Similarly, I'm hoping that if we can get the 
framework-provided IOs to behave properly with timely progress reporting then 
that would also cover most scenarios.  We'd need to track down and fix custom 
IOs that are not reporting timely progress, but we can always disable the task 
timeout on a per job basis until those IOs are fixed.  And just like MapReduce, 
we can track down and fix the problematic processors that can chew on inputs 
for a long time before consuming more to explicitly report progress as they do 
so.


> Handle task attempts that are not making progress
> -------------------------------------------------
>
>                 Key: TEZ-808
>                 URL: https://issues.apache.org/jira/browse/TEZ-808
>             Project: Apache Tez
>          Issue Type: Sub-task
>            Reporter: Bikas Saha
>
> If a task attempt is not making progress then it may cause the job to hang. 
> We may want to kill and restart the attempt. With speculation support and 
> free resources we may want to run another version in parallel.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to