Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/8997 )
Change subject: IMPALA-6338: Fix flaky test_profile_fragment_instances ...................................................................... IMPALA-6338: Fix flaky test_profile_fragment_instances test_profile_fragment_instances checks that, once all the results have been returned, every fragment instance appears in the query profile for a query that internally cancels fragment instances that are still executing when the results have been fully returned. Every fis is guaranteed to send a profile to the coordinator in Finalize(), but previously fragment profiles were not applied by the coordinator if the backend was 'done', defined as either all instances having completed or one has entered an error state (including cancelled). So, the test could fail by the following sequence: - Some fragment for a particular backend sends an update to the coordinator. 'returned_all_results_' is true, so the coordinator responds indicating the the backend should cancel its remaining fragments. - Another fragment from that backend executes Finalize() and reports that it was cancelled. This causes the coordinator to consider the entire backend to be 'done'. - A third fragment, which had not previously sent a report from the reporting thread, from the same backend executes Finalize(). This report will not be applied by the coordinator as the backend is considered 'done', so this fragment will not appear in the final profile. The solution is to change the definition of 'done' to not include a backend that has been cancelled but still has fragments that haven't completed. This guarantees that for queries that complete successfully and are cancelled internally, all fis will send a report and have it applied by the coordinator before all results have been returned, since if eos is true Coordinator::GetNext() calls WaitForBackendCompletion(), which in this situation will now wait for all fis to Finalize(). Returning results for queries that are cancelled by the user is unaffected as the manual cancel path causes WaitForBackendCompletion(). Testing: - Ran test_profile_fragment_instances in a loop with no failures. I can reliably repro the original problem with a few carefully placed sleeps. Change-Id: I77773a1e3c62952003f37f88fe2b662bb11889ed Reviewed-on: http://gerrit.cloudera.org:8080/8997 Reviewed-by: Tim Armstrong <[email protected]> Tested-by: Impala Public Jenkins --- M be/src/common/status.h M be/src/runtime/coordinator-backend-state.cc M be/src/runtime/coordinator-backend-state.h M be/src/runtime/coordinator.cc M be/src/runtime/coordinator.h M tests/query_test/test_observability.py 6 files changed, 30 insertions(+), 18 deletions(-) Approvals: Tim Armstrong: Looks good to me, approved Impala Public Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/8997 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: I77773a1e3c62952003f37f88fe2b662bb11889ed Gerrit-Change-Number: 8997 Gerrit-PatchSet: 7 Gerrit-Owner: Thomas Tauber-Marshall <[email protected]> Gerrit-Reviewer: Dan Hecht <[email protected]> Gerrit-Reviewer: Impala Public Jenkins Gerrit-Reviewer: Joe McDonnell <[email protected]> Gerrit-Reviewer: Michael Ho <[email protected]> Gerrit-Reviewer: Sailesh Mukil <[email protected]> Gerrit-Reviewer: Thomas Tauber-Marshall <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]>
