[ 
https://issues.apache.org/jira/browse/TEZ-808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14957536#comment-14957536
 ] 

Jason Lowe commented on TEZ-808:
--------------------------------

bq. Fixing IOs vs Fixing processor callback - which one of these would benefit 
the case for the stack trace shown above?  I am not sure if the processor was 
waiting for a tuple from an IO or from somewhere else?
To be honest I'm not sure either.  I'm not familiar with the internal workings 
of Pig streaming and how that interfaces with the IOs.  Rohini can comment more 
here and is already tracking down from the Pig side what went wrong for this 
specific scenario.  However I'm not that interested in trying to fix a very 
specific instance of a task hanging in this JIRA.  It's very likely a bug in 
the Pig streaming code that can be fixed separately.  For this JIRA I'm much 
more interested in getting Tez to handle a broad spectrum of ways tasks can 
hang.

That's why I think prioritizing IO progress reporting is key for this JIRA.  It 
will catch the cases where the task is not making IO progress, and almost all 
tasks will normally have regular IO interaction.  So even if the processor is 
the problem and hangs the lack of IO interaction would have also flagged it as 
no progress.  If the bug is in the IO layer and it hangs then there will also 
be lack of IO interaction as eventually everything will backup and stall on the 
hung IO.  Yes, there will be cases where periods of no IO progress should not 
be fatal to the task, but we can handle that with the processor progress API 
and also work around it in the interim with the task timeout tunable.

> Handle task attempts that are not making progress
> -------------------------------------------------
>
>                 Key: TEZ-808
>                 URL: https://issues.apache.org/jira/browse/TEZ-808
>             Project: Apache Tez
>          Issue Type: Sub-task
>            Reporter: Bikas Saha
>
> If a task attempt is not making progress then it may cause the job to hang. 
> We may want to kill and restart the attempt. With speculation support and 
> free resources we may want to run another version in parallel.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to