[
https://issues.apache.org/jira/browse/TEZ-808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14976420#comment-14976420
]
Bikas Saha commented on TEZ-808:
--------------------------------
I had thought of some use for the number of invocations which I just cannot
recall right now :(. If I can't recall that I am fine with changing it to a
boolean. However, even volatile is not going to be great for perf because it
would cause a main memory access to ensure the volatile visibility guarantee.
And doing that in a tight inner loop would not be great. The intent of the
javadoc was to advise users not to call this in some record processing tight
inner loop since (perf issues aside) calling from user code into framework code
would result in method invocations that may or may not be inlined by the JVM.
If we use a boolean, then I think it will be fine to not use volatile since we
are not looking at fine grained inter-thread collisions. The heartbeat thread
reads this quite infrequently and a true value should be visible to it when it
executes. Thoughts?
I chose milliseconds because for LLAP scenarios in hive 500ms stalls might be
long enough to warrant similar actions.
I am not certain we can change the default to do this always because there
probably jobs out there in the wild where the processor sits and spends a lot
of time doing stuff. E.g. expensive map joins etc. So without having those
processors implement regular progress notifications, we may end up incorrectly
failing tasks that are already expensive and become backwards incompatible.
Right?
> Handle task attempts that are not making progress
> -------------------------------------------------
>
> Key: TEZ-808
> URL: https://issues.apache.org/jira/browse/TEZ-808
> Project: Apache Tez
> Issue Type: Sub-task
> Reporter: Bikas Saha
> Assignee: Bikas Saha
> Attachments: TEZ-808.1.patch
>
>
> If a task attempt is not making progress then it may cause the job to hang.
> We may want to kill and restart the attempt. With speculation support and
> free resources we may want to run another version in parallel.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)