[
https://issues.apache.org/jira/browse/TEZ-808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14957536#comment-14957536
]
Jason Lowe commented on TEZ-808:
--------------------------------
bq. Fixing IOs vs Fixing processor callback - which one of these would benefit
the case for the stack trace shown above? I am not sure if the processor was
waiting for a tuple from an IO or from somewhere else?
To be honest I'm not sure either. I'm not familiar with the internal workings
of Pig streaming and how that interfaces with the IOs. Rohini can comment more
here and is already tracking down from the Pig side what went wrong for this
specific scenario. However I'm not that interested in trying to fix a very
specific instance of a task hanging in this JIRA. It's very likely a bug in
the Pig streaming code that can be fixed separately. For this JIRA I'm much
more interested in getting Tez to handle a broad spectrum of ways tasks can
hang.
That's why I think prioritizing IO progress reporting is key for this JIRA. It
will catch the cases where the task is not making IO progress, and almost all
tasks will normally have regular IO interaction. So even if the processor is
the problem and hangs the lack of IO interaction would have also flagged it as
no progress. If the bug is in the IO layer and it hangs then there will also
be lack of IO interaction as eventually everything will backup and stall on the
hung IO. Yes, there will be cases where periods of no IO progress should not
be fatal to the task, but we can handle that with the processor progress API
and also work around it in the interim with the task timeout tunable.
> Handle task attempts that are not making progress
> -------------------------------------------------
>
> Key: TEZ-808
> URL: https://issues.apache.org/jira/browse/TEZ-808
> Project: Apache Tez
> Issue Type: Sub-task
> Reporter: Bikas Saha
>
> If a task attempt is not making progress then it may cause the job to hang.
> We may want to kill and restart the attempt. With speculation support and
> free resources we may want to run another version in parallel.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)