[
https://issues.apache.org/jira/browse/TEZ-808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14957151#comment-14957151
]
Jason Lowe commented on TEZ-808:
--------------------------------
bq. If the task was not using incremental units of cpu and disk then it would
be flagged as stuck despite progress updates. This would catch both deadlock
(no cpu) and livelocks (spinning cpu).
Because Tez uses a separate thread for heartbeating to the AM, it will hardly
ever deadlock completely -- there will always be that thread waking up to ping
the AM and generate CPU activity. And the problem with watching the disk is
that it assumes the task will consume local disk as part of its processing.
Most jobs will, but I think it is possible to create input and outputs that
wouldn't.
bq. Add logic in TezChild to track progress based on stats progress by IOs and
the number of invocations of processorContext.setProgress(). Send this
information to the AM which would terminate tasks that make no indications of
progress for a configurable period of time (this jira)
Is that sufficient if we're only updating IO stats progress when it's closed?
And will we have issues during shuffle where we haven't started processing yet,
waiting for the last upstream task to complete? Main concern I have is that
this could start killing legitimate tasks left and right in practice because
the framework isn't reporting progress often enough and nobody has updated
their custom implementations to do so.
In practice it's rare for user-provided map or reduce methods to need explicit
progress reporting because the framework-supplied progress reporting covers it
in the vast majority of cases. Similarly, I'm hoping that if we can get the
framework-provided IOs to behave properly with timely progress reporting then
that would also cover most scenarios. We'd need to track down and fix custom
IOs that are not reporting timely progress, but we can always disable the task
timeout on a per job basis until those IOs are fixed. And just like MapReduce,
we can track down and fix the problematic processors that can chew on inputs
for a long time before consuming more to explicitly report progress as they do
so.
> Handle task attempts that are not making progress
> -------------------------------------------------
>
> Key: TEZ-808
> URL: https://issues.apache.org/jira/browse/TEZ-808
> Project: Apache Tez
> Issue Type: Sub-task
> Reporter: Bikas Saha
>
> If a task attempt is not making progress then it may cause the job to hang.
> We may want to kill and restart the attempt. With speculation support and
> free resources we may want to run another version in parallel.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)