[jira] [Commented] (TEZ-808) Handle task attempts that are not making progress

Bikas Saha (JIRA) Tue, 13 Oct 2015 15:27:49 -0700

    [ 
https://issues.apache.org/jira/browse/TEZ-808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14955823#comment-14955823
 ]


Bikas Saha commented on TEZ-808:
--------------------------------

I understand these nuances. The trouble is that all of the IPOs are black boxes 
for Tez and Tez child. So the framework does not have visibility into these 
details (or even KV pairs as a concept). IOs have access to an 
(Input|Output)StatisticsReporter which they can use to report number of items 
processed. So we have a potential solution there except that current impls for 
IOs only update that number upon close. So they are not useful today but can be 
updated periodically.
But the situation is different for a processor and the processor is the 
specific situation we are seeing and the generic potential problem in the 
cluster since most IO's tend to be framework code and processors tend to be 
user code. There is typically no framework code running in the processor - e.g. 
PigProcessor. Processors have a ProcessorContext.setProgress() method today to 
report progress and that is the progress reported by the heartbeat. You are 
right that the value of the progress field can vary widely, the fact that 
progress() method has been called can be tracked (which we don't do today). 
However, the onus is still on PigProcessor to call that method in a manner that 
reflects reality. E.g. if processors set up a thread to periodically call 
progress with the value from a shared variable then even if the actual 
processing is stuck, the progress() method would still end up being called 
repeatedly. In order to circumvent these situations, in a different project, we 
had implemented cpu and IO monitoring of the task. If the task was not using 
incremental units of cpu and disk then it would be flagged as stuck despite 
progress updates. This would catch both deadlock (no cpu) and livelocks 
(spinning cpu). And over and above all that, we had a max execution time that 
would be improbable to exceed for any normal tasks but which could be 
configured by users who know what they are doing.

In any case, unrolling the discussion back to the problem at hand. Here are 
various action items at hand.
1) Add finer grained updates of processed items from the IOs as an indicator of 
progress or a new reportProgress() API. This can be deferred for now.
2) Add logic in TezChild to track progress based on stats progress by IOs and 
the number of invocations of processorContext.setProgress(). Send this 
information to the AM which would terminate tasks that make no indications of 
progress for a configurable period of time (this jira)
3) If needed, PigProcessor should ensure that processorContext.setProgress() is 
being invoked by it. And that it is being invoked in a manner that will not 
hide hung code because it would be dependent on real work being done.
4) Potentially add a safety threshold on how long a task should run before its 
considered bad. This can be deferred for now.

Does that sound reasonable or are there further comments?

> Handle task attempts that are not making progress
> -------------------------------------------------
>
>                 Key: TEZ-808
>                 URL: https://issues.apache.org/jira/browse/TEZ-808
>             Project: Apache Tez
>          Issue Type: Sub-task
>            Reporter: Bikas Saha
>
> If a task attempt is not making progress then it may cause the job to hang. 
> We may want to kill and restart the attempt. With speculation support and 
> free resources we may want to run another version in parallel.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-808) Handle task attempts that are not making progress

Reply via email to