[
https://issues.apache.org/jira/browse/TEZ-2808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14739676#comment-14739676
]
Bikas Saha commented on TEZ-2808:
---------------------------------
Both assignment logic and preemption logic take the scheduler lock to be
exclusive. However, preemption checks for delayed containers inside the lock
while assignment poll's (and removes) the delayed containers outside the lock
thus leading to the race condition. The simplest/safest fix would be to poll
the delayed containers (shared variable in this case) inside the lock. Hard to
write a test for this. This was found (intermittently) when numNMs in
TestAnalyzer was reduced to 1. Verified that with the fix, this does not happen
anymore for many runs. [~hitesh] Please review.
> Race condition between preemption and container assignment
> ----------------------------------------------------------
>
> Key: TEZ-2808
> URL: https://issues.apache.org/jira/browse/TEZ-2808
> Project: Apache Tez
> Issue Type: Bug
> Reporter: Bikas Saha
> Assignee: Bikas Saha
> Attachments: TEZ-2808.1.patch
>
>
> New container allocated. Put in delayed container manager queue. Triggers
> assignment run on delayed container manager thread.
> On AMRMClient thread callback, preemption is called. This is to ensure
> preemption logic is guaranteed to be invoked at regular intervals even though
> nothing else may be happening because there are no containers
> allocated/to-match. Preemption logic checks if containers are available to
> assign by looking at delayed container manager queue. If by this time, the
> assignment thread has polled the queue to remove the container for assignment
> checking, then the preemption code will see no containers available to
> assign. So it proceeds to preempt containers.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)