[ 
https://issues.apache.org/jira/browse/TEZ-808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14957655#comment-14957655
 ] 

Bikas Saha commented on TEZ-808:
--------------------------------

In theory, IPOs are independent entities and so they dont understand specifics 
of each others implementation. Hence the processor has its own 
setProgress(float) where the argument is supposed to report just the processors 
progress. it does not have to report aggregate progress across inputs and 
outputs by itself too. So PigProcessor can report progress based on its 
internal processing. Overall progress is not really just based on the 
input/processor. After the processor is done, the output also has a long time 
to report progress in some case (e.g. large sort). So we do need to keep them 
separate. After we add similar setProgress API for the IOs then IOs can report 
their progress too. How we combine them is a presentation layer artifact (while 
here we are focusing on fault tolerance).

Here is what we should do
1) Enable tracking of existing setProgress(float) calls so that we know 
progress is not being made if setProgress has not been invoked for a while. AM 
tracks progress updates and flags a task as hung if no progress update happens 
for a configurable time.
2) Add similar setProgress(float) calls for IOs so that they can also report 
progress being made (with similar tracking as 1). Internal IOs to be updated to 
call setProgress() at different processing points. The logic added in 1 will 
take care of combined IOP progress tracking. this will take care of the 90% 
scenario.
The work is large enough to split into 2 jiras along the same lines. Any 
further suggestions/additions?


> Handle task attempts that are not making progress
> -------------------------------------------------
>
>                 Key: TEZ-808
>                 URL: https://issues.apache.org/jira/browse/TEZ-808
>             Project: Apache Tez
>          Issue Type: Sub-task
>            Reporter: Bikas Saha
>
> If a task attempt is not making progress then it may cause the job to hang. 
> We may want to kill and restart the attempt. With speculation support and 
> free resources we may want to run another version in parallel.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to