Tim Armstrong has uploaded this change for review. ( http://gerrit.cloudera.org:8080/15672
Change subject: IMPALA-9611: fix hang when cancelling join builder ...................................................................... IMPALA-9611: fix hang when cancelling join builder The error could occur in the following scenario, where thread A is executing a join build fragment and thread B is cancelling the fragment instance. 1. Thread A is in HandoffToProbesAndWait(), reads is_cancelled_ and sees false. 2. Thread B in RuntimeState::Cancel() sets is_cancelled_ = true, acquires cancellation_cvs_lock_, then calls NotifyAll() on the condition variable 3. Thread A calls Wait() on the condition variable, blocks forever because cancellation already happened. The fix is for thread B to acquire the lock that thread A is holding. That prevents the race because #1 and #3 above are in the same critical section and thread B won't be able to signal the condition variable until thread A has released it. Testing: Added metric check to test_failpoints to make it easier to detect hangs caused by those tests in future. Looped test_failpoints.py overnight, which was previously enough to reproduce the failure within a couple of hours. Ran exhaustive tests. Change-Id: I996ad2055d6542eb57e12c663b89de5f84208f77 --- M be/src/exec/join-builder.cc M be/src/runtime/runtime-state.cc M be/src/runtime/runtime-state.h M tests/failure/test_failpoints.py 4 files changed, 35 insertions(+), 11 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/72/15672/1 -- To view, visit http://gerrit.cloudera.org:8080/15672 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: I996ad2055d6542eb57e12c663b89de5f84208f77 Gerrit-Change-Number: 15672 Gerrit-PatchSet: 1 Gerrit-Owner: Tim Armstrong <[email protected]>
