[jira] [Commented] (TEZ-808) Handle task attempts that are not making progress

Jason Lowe (JIRA) Tue, 27 Oct 2015 13:04:28 -0700

    [ 
https://issues.apache.org/jira/browse/TEZ-808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14977056#comment-14977056
 ]


Jason Lowe commented on TEZ-808:
--------------------------------

bq. If we use a boolean, then I think it will be fine to not use volatile since 
we are not looking at fine grained inter-thread collisions. The heartbeat 
thread reads this quite infrequently and a true value should be visible to it 
when it executes. Thoughts?

The concern is that if we don't mark it volatile then theoretically the JVM 
could optimize the code in such a way that it mistakenly thinks it never needs 
to push the boolean update out to memory during the main processing loop (i.e.: 
it ends up keeping it in a register or some other type of thread-local 
storage).  If that were to ever occur then a separate observing thread would 
never see the boolean update.  In practice I don't think JVMs would end up 
doing that much optimization, but I believe it would be theoretically possible. 
 I'm not a JVM optimization/memory model expert though.  If we know the raw 
boolean will always work in practice then I'm OK with it.  It's not necessary 
that the other thread sees it immediately after being updated rather just 
within a reasonable timeframe afterwards.

bq. I chose milliseconds because for LLAP scenarios in hive 500ms stalls might 
be long enough to warrant similar actions.

Ah, sorry I didn't think anyone would try to make sub-second work given all the 
levels of retries, etc. in the RPC layer itself that can easily blow past the 
seconds barrier.  Given that it only checks the timestamp when a heartbeat 
eventually does arrive, I assume LLAP scenarios also drastically reduce the 
container heartbeat expiration interval?  Speaking of which, do we need to 
automatically lower the expire interval on the task heartbeat monitor if the 
progress timeout is lower than the heartbeat timeout?

bq. So without having those processors implement regular progress 
notifications, we may end up incorrectly failing tasks that are already 
expensive and become backwards incompatible. Right?

Yes, it is theoretically backwards incompatible to enable the timeout when it 
wasn't enabled before.  I guess we'll just have to make sure we call out the 
new property when it should work out-of-the-box (i.e.: when IO framework 
progress hookups are complete).  That way users hopefully will discover they 
don't want to run most production cluster setups without this timeout enabled 
before they are bitten by the default settings when a tasks hangs.


> Handle task attempts that are not making progress
> -------------------------------------------------
>
>                 Key: TEZ-808
>                 URL: https://issues.apache.org/jira/browse/TEZ-808
>             Project: Apache Tez
>          Issue Type: Sub-task
>            Reporter: Bikas Saha
>            Assignee: Bikas Saha
>         Attachments: TEZ-808.1.patch
>
>
> If a task attempt is not making progress then it may cause the job to hang. 
> We may want to kill and restart the attempt. With speculation support and 
> free resources we may want to run another version in parallel.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TEZ-808) Handle task attempts that are not making progress

Reply via email to