Thomas Tauber-Marshall has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/9718


Change subject: IMPALA-6338: Fix flaky test_profile_fragment_instances
......................................................................

IMPALA-6338: Fix flaky test_profile_fragment_instances

test_profile_fragment_instances checks that, once all the results have
been returned, every fragment instance appears in the query profile
for a query that internally cancels fragment instances that are still
executing when the results have been fully returned.

Every fis is guaranteed to send a profile to the coordinator in
Finalize(), but previously fragment profiles were not applied by the
coordinator if the backend was 'done', defined as either all instances
having completed or one has entered an error state (including
cancelled).

So, the test could fail by the following sequence:
- Some fragment for a particular backend sends an update to the
  coordinator. 'returned_all_results_' is true, so the coordinator
  responds indicating the the backend should cancel its remaining
  fragments.
- Another fragment from that backend executes Finalize() and reports
  that it was cancelled. This causes the coordinator to consider the
  entire backend to be 'done'.
- A third fragment, which had not previously sent a report from the
  reporting thread, from the same backend executes Finalize(). This
  report will not be applied by the coordinator as the backend is
  considered 'done', so this fragment will not appear in the final
  profile.

The solution is to change the definition of 'done' to not include a
backend that has been cancelled but still has fragments that haven't
completed. This guarantees that for queries that complete successfully
and are cancelled internally, all fis will send a report and have it
applied by the coordinator before all results have been returned,
since if eos is true Coordinator::GetNext() calls
WaitForBackendCompletion(), which in this situation will now wait for
all fis to Finalize().

Returning results for queries that are cancelled by the user is
unaffected as the manual cancel path causes WaitForBackendCompletion().

Testing:
- Ran test_profile_fragment_instances in a loop with no failures.
  I can reliably repro the original problem with a few carefully
  placed sleeps.

Change-Id: I2f8295e4ec580b476f7668c77aec1348fb2e868c
---
M be/src/common/status.h
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/coordinator-backend-state.h
M be/src/runtime/coordinator.cc
M be/src/runtime/coordinator.h
M tests/query_test/test_observability.py
6 files changed, 37 insertions(+), 20 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/18/9718/1
--
To view, visit http://gerrit.cloudera.org:8080/9718
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I2f8295e4ec580b476f7668c77aec1348fb2e868c
Gerrit-Change-Number: 9718
Gerrit-PatchSet: 1
Gerrit-Owner: Thomas Tauber-Marshall <tmarsh...@cloudera.com>

Reply via email to