Thomas Marshall has uploaded this change for review. ( http://gerrit.cloudera.org:8080/12049
Change subject: IMPALA-4555: Make QueryState's status reporting more robust ...................................................................... IMPALA-4555: Make QueryState's status reporting more robust QueryState periodically collects runtime profiles from all of its fragment instances and sends them to the coordinator. Previously, each time this happens, if the rpc fails, QueryState will retry twice after a configurable timeout and then cancel the fragment instances under the assumption that the coordinator no longer exists. We've found in real clusters that this logic is too sensitive to failed rpcs and can result in fragment instances being cancelled even in cases where the coordinator is still running. This patch makes a few improvements to this logic: - When an intermediate report is generated, if the first attempt to send it to the coordinator fails, the report is discarded. Only the final report, when all of the fragments instances have completed, is retried. - Exponential backoff is used, both for the time between generating intermediate reports (controlled by FLAG_status_report_interval_ms) and for the time between retries for the final report (controlled by FLAG_report_status_retry_interval_ms) such that for a period between retries of 't', on try 'n' the actual timeout will be t * n. Testing: - Added a test which results in a large number of failed intermediate status reports but still succeeds. Change-Id: Ib6007013fc2c9e8eeba11b752ee58fb3038da971 --- M be/src/runtime/coordinator-backend-state.cc M be/src/runtime/fragment-instance-state.cc M be/src/runtime/fragment-instance-state.h M be/src/runtime/query-state.cc M be/src/runtime/query-state.h M be/src/runtime/runtime-state.cc M be/src/runtime/runtime-state.h M tests/custom_cluster/test_rpc_timeout.py 8 files changed, 61 insertions(+), 15 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/49/12049/1 -- To view, visit http://gerrit.cloudera.org:8080/12049 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: Ib6007013fc2c9e8eeba11b752ee58fb3038da971 Gerrit-Change-Number: 12049 Gerrit-PatchSet: 1 Gerrit-Owner: Thomas Marshall <thomasmarsh...@cmu.edu>