[
https://issues.apache.org/jira/browse/TEZ-808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14957655#comment-14957655
]
Bikas Saha commented on TEZ-808:
--------------------------------
In theory, IPOs are independent entities and so they dont understand specifics
of each others implementation. Hence the processor has its own
setProgress(float) where the argument is supposed to report just the processors
progress. it does not have to report aggregate progress across inputs and
outputs by itself too. So PigProcessor can report progress based on its
internal processing. Overall progress is not really just based on the
input/processor. After the processor is done, the output also has a long time
to report progress in some case (e.g. large sort). So we do need to keep them
separate. After we add similar setProgress API for the IOs then IOs can report
their progress too. How we combine them is a presentation layer artifact (while
here we are focusing on fault tolerance).
Here is what we should do
1) Enable tracking of existing setProgress(float) calls so that we know
progress is not being made if setProgress has not been invoked for a while. AM
tracks progress updates and flags a task as hung if no progress update happens
for a configurable time.
2) Add similar setProgress(float) calls for IOs so that they can also report
progress being made (with similar tracking as 1). Internal IOs to be updated to
call setProgress() at different processing points. The logic added in 1 will
take care of combined IOP progress tracking. this will take care of the 90%
scenario.
The work is large enough to split into 2 jiras along the same lines. Any
further suggestions/additions?
> Handle task attempts that are not making progress
> -------------------------------------------------
>
> Key: TEZ-808
> URL: https://issues.apache.org/jira/browse/TEZ-808
> Project: Apache Tez
> Issue Type: Sub-task
> Reporter: Bikas Saha
>
> If a task attempt is not making progress then it may cause the job to hang.
> We may want to kill and restart the attempt. With speculation support and
> free resources we may want to run another version in parallel.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)