[
https://issues.apache.org/jira/browse/TEZ-808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14957408#comment-14957408
]
Bikas Saha commented on TEZ-808:
--------------------------------
bq. Is that sufficient if we're only updating IO stats progress when it's
closed? And will we have issues during shuffle where we haven't started
processing yet
Like I said in item 1) above - add finer grained updates of processed items
from the IOs as an indicator of progress or a new reportProgress() API.
So a new progress API for IOs can help alleviate issues that you mention.
Taking a step back, I think we agree on the issues and various things that can
be done to fix them. This is clearly an area where a bunch of heuristics come
into play and that is usually a grey area. So lets figure out what the next
steps should be.
For the hung job that you saw, was the processor stuck or was it an IO?
1) If it was an IO then the next step would be to add the progress API on the
IOs and enhance the internal framework IO code to make those progress calls. If
the IO was actually hung then perhaps we also need a separate jira to fix the
bug that made it hang.
2) If it was a processor then the next step would be to improve Tez Child code
to track the number of invocations of processor.setProgress() so that we can
figure out if the processor is stuck or not because the user code has not
called us back. We would also need to check the your environment processor code
to verify that it calls setProgress in a correct manner.
3) Assume the processors cannot be depended on to be well behaved - so only do
1 and use a lack of progress from IOs as a proxy for the processor not doing
anything. IMO, this by itself would not always work without 2) since a
processor could read all the input, spend a lot of time crunching through it
before writing anything out. And during that pure processing time we could flag
the task as hung because the IO's are not making progress.
If the above does not capture your scenario, then perhaps you could describe
the scenario and propose some solutions. Lets get to a set of concrete action
items :)
> Handle task attempts that are not making progress
> -------------------------------------------------
>
> Key: TEZ-808
> URL: https://issues.apache.org/jira/browse/TEZ-808
> Project: Apache Tez
> Issue Type: Sub-task
> Reporter: Bikas Saha
>
> If a task attempt is not making progress then it may cause the job to hang.
> We may want to kill and restart the attempt. With speculation support and
> free resources we may want to run another version in parallel.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)