[
https://issues.apache.org/jira/browse/TEZ-808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14977056#comment-14977056
]
Jason Lowe commented on TEZ-808:
--------------------------------
bq. If we use a boolean, then I think it will be fine to not use volatile since
we are not looking at fine grained inter-thread collisions. The heartbeat
thread reads this quite infrequently and a true value should be visible to it
when it executes. Thoughts?
The concern is that if we don't mark it volatile then theoretically the JVM
could optimize the code in such a way that it mistakenly thinks it never needs
to push the boolean update out to memory during the main processing loop (i.e.:
it ends up keeping it in a register or some other type of thread-local
storage). If that were to ever occur then a separate observing thread would
never see the boolean update. In practice I don't think JVMs would end up
doing that much optimization, but I believe it would be theoretically possible.
I'm not a JVM optimization/memory model expert though. If we know the raw
boolean will always work in practice then I'm OK with it. It's not necessary
that the other thread sees it immediately after being updated rather just
within a reasonable timeframe afterwards.
bq. I chose milliseconds because for LLAP scenarios in hive 500ms stalls might
be long enough to warrant similar actions.
Ah, sorry I didn't think anyone would try to make sub-second work given all the
levels of retries, etc. in the RPC layer itself that can easily blow past the
seconds barrier. Given that it only checks the timestamp when a heartbeat
eventually does arrive, I assume LLAP scenarios also drastically reduce the
container heartbeat expiration interval? Speaking of which, do we need to
automatically lower the expire interval on the task heartbeat monitor if the
progress timeout is lower than the heartbeat timeout?
bq. So without having those processors implement regular progress
notifications, we may end up incorrectly failing tasks that are already
expensive and become backwards incompatible. Right?
Yes, it is theoretically backwards incompatible to enable the timeout when it
wasn't enabled before. I guess we'll just have to make sure we call out the
new property when it should work out-of-the-box (i.e.: when IO framework
progress hookups are complete). That way users hopefully will discover they
don't want to run most production cluster setups without this timeout enabled
before they are bitten by the default settings when a tasks hangs.
> Handle task attempts that are not making progress
> -------------------------------------------------
>
> Key: TEZ-808
> URL: https://issues.apache.org/jira/browse/TEZ-808
> Project: Apache Tez
> Issue Type: Sub-task
> Reporter: Bikas Saha
> Assignee: Bikas Saha
> Attachments: TEZ-808.1.patch
>
>
> If a task attempt is not making progress then it may cause the job to hang.
> We may want to kill and restart the attempt. With speculation support and
> free resources we may want to run another version in parallel.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)