[ 
https://issues.apache.org/jira/browse/TEZ-808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14957408#comment-14957408
 ] 

Bikas Saha commented on TEZ-808:
--------------------------------

bq. Is that sufficient if we're only updating IO stats progress when it's 
closed? And will we have issues during shuffle where we haven't started 
processing yet
Like I said in item 1) above -  add finer grained updates of processed items 
from the IOs as an indicator of progress or a new reportProgress() API.
So a new progress API for IOs can help alleviate issues that you mention. 

Taking a step back, I think we agree on the issues and various things that can 
be done to fix them. This is clearly an area where a bunch of heuristics come 
into play and that is usually a grey area. So lets figure out what the next 
steps should be.

For the hung job that you saw, was the processor stuck or was it an IO?

1) If it was an IO then the next step would be to add the progress API on the 
IOs and enhance the internal framework IO code to make those progress calls. If 
the IO was actually hung then perhaps we also need a separate jira to fix the 
bug that made it hang.

2) If it was a processor then the next step would be to improve Tez Child code 
to track the number of invocations of processor.setProgress() so that we can 
figure out if the processor is stuck or not because the user code has not 
called us back. We would also need to check the your environment processor code 
to verify that it calls setProgress in a correct manner.

3) Assume the processors cannot be depended on to be well behaved - so only do 
1 and use a lack of progress from IOs as a proxy for the processor not doing 
anything. IMO, this by itself would not always work without 2) since a 
processor could read all the input, spend a lot of time crunching through it 
before writing anything out. And during that pure processing time we could flag 
the task as hung because the IO's are not making progress.

If the above does not capture your scenario, then perhaps you could describe 
the scenario and propose some solutions. Lets get to a set of concrete action 
items :)

> Handle task attempts that are not making progress
> -------------------------------------------------
>
>                 Key: TEZ-808
>                 URL: https://issues.apache.org/jira/browse/TEZ-808
>             Project: Apache Tez
>          Issue Type: Sub-task
>            Reporter: Bikas Saha
>
> If a task attempt is not making progress then it may cause the job to hang. 
> We may want to kill and restart the attempt. With speculation support and 
> free resources we may want to run another version in parallel.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to