[ 
https://issues.apache.org/jira/browse/TEZ-808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14976420#comment-14976420
 ] 

Bikas Saha commented on TEZ-808:
--------------------------------

I had thought of some use for the number of invocations which I just cannot 
recall right now :(. If I can't recall that I am fine with changing it to a 
boolean. However, even volatile is not going to be great for perf because it 
would cause a main memory access to ensure the volatile visibility guarantee. 
And doing that in a tight inner loop would not be great. The intent of the 
javadoc was to advise users not to call this in some record processing tight 
inner loop since (perf issues aside) calling from user code into framework code 
would result in method invocations that may or may not be inlined by the JVM. 

If we use a boolean, then I think it will be fine to not use volatile since we 
are not looking at fine grained inter-thread collisions. The heartbeat thread 
reads this quite infrequently and a true value should be visible to it when it 
executes. Thoughts?

I chose milliseconds because for LLAP scenarios in hive 500ms stalls might be 
long enough to warrant similar actions.

I am not certain we can change the default to do this always because there 
probably jobs out there in the wild where the processor sits and spends a lot 
of time doing stuff. E.g. expensive map joins etc. So without having those 
processors implement regular progress notifications, we may end up incorrectly 
failing tasks that are already expensive and become backwards incompatible. 
Right?

> Handle task attempts that are not making progress
> -------------------------------------------------
>
>                 Key: TEZ-808
>                 URL: https://issues.apache.org/jira/browse/TEZ-808
>             Project: Apache Tez
>          Issue Type: Sub-task
>            Reporter: Bikas Saha
>            Assignee: Bikas Saha
>         Attachments: TEZ-808.1.patch
>
>
> If a task attempt is not making progress then it may cause the job to hang. 
> We may want to kill and restart the attempt. With speculation support and 
> free resources we may want to run another version in parallel.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to