[ 
https://issues.apache.org/jira/browse/TEZ-808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974432#comment-14974432
 ] 

Jason Lowe commented on TEZ-808:
--------------------------------

Thanks for the patch, Bikas!  I haven't had a chance to look at it in great 
detail, but here are a few comments from an initial glance at the patch:

IMHO notifyProgress needs to be made as cheap as possible, and the javadoc 
implies that it is expensive.  Users (and even the IO framework code) will want 
to call this after each input record processed, for example.  If possible, we 
shouldn't burden them with having to go out of their way to avoid calling it 
too often.  I'm also a little confused as to why we are tracking how many times 
its called versus whether it's been called at all.  Doing so requires a 
read-modify-write and therefore could have performance issues with concurrency. 
 I think we could make this very cheap by just having a volatile flag that 
indicates progress as been signalled and let the separate task hearbeating 
thread report and clear that flag when it heartbeats to the AM.  Then the call 
is very cheap, just a boolean write, and can be called just about anywhere in 
the code without a significant performance impact.  Yes, there would be a 
slight race where we could miss a (redundant) progress update just as we read 
and clear the flag, but missing that update is not going to be critical to 
whether a task fails in practice.

Nit: A timeout of zero should also be treated as no timeout being set.  It 
makes no sense to run it with a value of zero.

Nit: Do we want the property to be in milliseconds?  I can't see any user ever 
wanting this property to have sub-second granularity, and therefore 
representing it in milliseconds seems like we're just making it harder to set.  
Also I am assuming the followup JIRA to have the framework notify of progress 
will also change the default value for the hung progress property so the AM is 
monitoring for lack of progress by default.


> Handle task attempts that are not making progress
> -------------------------------------------------
>
>                 Key: TEZ-808
>                 URL: https://issues.apache.org/jira/browse/TEZ-808
>             Project: Apache Tez
>          Issue Type: Sub-task
>            Reporter: Bikas Saha
>            Assignee: Bikas Saha
>         Attachments: TEZ-808.1.patch
>
>
> If a task attempt is not making progress then it may cause the job to hang. 
> We may want to kill and restart the attempt. With speculation support and 
> free resources we may want to run another version in parallel.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to