Impala Public Jenkins has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/12049 )

Change subject: IMPALA-4555: Make QueryState's status reporting more robust
......................................................................

IMPALA-4555: Make QueryState's status reporting more robust

QueryState periodically collects runtime profiles from all of its
fragment instances and sends them to the coordinator. Previously, each
time this happens, if the rpc fails, QueryState will retry twice after
a configurable timeout and then cancel the fragment instances under
the assumption that the coordinator no longer exists.

We've found in real clusters that this logic is too sensitive to
failed rpcs and can result in fragment instances being cancelled even
in cases where the coordinator is still running.

This patch makes a few improvements to this logic:
- When a report fails to send, instead of retrying the same report
  quickly (after waiting report_status_retry_interval_ms), we wait the
  regular reporting interval (status_report_interval_ms), regenerate
  any stale portions of the report, and then retry.
- A new flag, --status_report_max_retries, is introduced, which
  controls the number of failed reports that are allowed before the
  query is cancelled. --report_status_retry_interval_ms is removed.
- Backoff is used for repeated failed attempts, such that for a period
  between retries of 't', on try 'n' the actual timeout will be t * n.

Testing:
- Added a test which results in a large number of failed intermediate
  status reports but still succeeds.

Change-Id: Ib6007013fc2c9e8eeba11b752ee58fb3038da971
Reviewed-on: http://gerrit.cloudera.org:8080/12049
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>
---
M be/src/common/global-flags.cc
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/fragment-instance-state.cc
M be/src/runtime/fragment-instance-state.h
M be/src/runtime/query-state.cc
M be/src/runtime/query-state.h
M common/protobuf/control_service.proto
M tests/custom_cluster/test_rpc_timeout.py
8 files changed, 192 insertions(+), 65 deletions(-)

Approvals:
  Impala Public Jenkins: Looks good to me, approved; Verified

--
To view, visit http://gerrit.cloudera.org:8080/12049
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: Ib6007013fc2c9e8eeba11b752ee58fb3038da971
Gerrit-Change-Number: 12049
Gerrit-PatchSet: 9
Gerrit-Owner: Thomas Marshall <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Michael Ho <[email protected]>
Gerrit-Reviewer: Thomas Marshall <[email protected]>

Reply via email to