[
https://issues.apache.org/jira/browse/TEZ-808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974432#comment-14974432
]
Jason Lowe commented on TEZ-808:
--------------------------------
Thanks for the patch, Bikas! I haven't had a chance to look at it in great
detail, but here are a few comments from an initial glance at the patch:
IMHO notifyProgress needs to be made as cheap as possible, and the javadoc
implies that it is expensive. Users (and even the IO framework code) will want
to call this after each input record processed, for example. If possible, we
shouldn't burden them with having to go out of their way to avoid calling it
too often. I'm also a little confused as to why we are tracking how many times
its called versus whether it's been called at all. Doing so requires a
read-modify-write and therefore could have performance issues with concurrency.
I think we could make this very cheap by just having a volatile flag that
indicates progress as been signalled and let the separate task hearbeating
thread report and clear that flag when it heartbeats to the AM. Then the call
is very cheap, just a boolean write, and can be called just about anywhere in
the code without a significant performance impact. Yes, there would be a
slight race where we could miss a (redundant) progress update just as we read
and clear the flag, but missing that update is not going to be critical to
whether a task fails in practice.
Nit: A timeout of zero should also be treated as no timeout being set. It
makes no sense to run it with a value of zero.
Nit: Do we want the property to be in milliseconds? I can't see any user ever
wanting this property to have sub-second granularity, and therefore
representing it in milliseconds seems like we're just making it harder to set.
Also I am assuming the followup JIRA to have the framework notify of progress
will also change the default value for the hung progress property so the AM is
monitoring for lack of progress by default.
> Handle task attempts that are not making progress
> -------------------------------------------------
>
> Key: TEZ-808
> URL: https://issues.apache.org/jira/browse/TEZ-808
> Project: Apache Tez
> Issue Type: Sub-task
> Reporter: Bikas Saha
> Assignee: Bikas Saha
> Attachments: TEZ-808.1.patch
>
>
> If a task attempt is not making progress then it may cause the job to hang.
> We may want to kill and restart the attempt. With speculation support and
> free resources we may want to run another version in parallel.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)