Hello Michael Ho, Impala Public Jenkins,
I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/12049
to look at the new patch set (#4).
Change subject: IMPALA-4555: Make QueryState's status reporting more robust
......................................................................
IMPALA-4555: Make QueryState's status reporting more robust
QueryState periodically collects runtime profiles from all of its
fragment instances and sends them to the coordinator. Previously, each
time this happens, if the rpc fails, QueryState will retry twice after
a configurable timeout and then cancel the fragment instances under
the assumption that the coordinator no longer exists.
We've found in real clusters that this logic is too sensitive to
failed rpcs and can result in fragment instances being cancelled even
in cases where the coordinator is still running.
This patch makes a few improvements to this logic:
- When a report fails to send, instead of retrying the same report
quickly (after waiting report_status_retry_interval_ms), we wait the
regular reporting interval (status_report_interval_ms), regenerate
any stale portions of the report, and then retry.
- A new flag, --status_report_max_retries, is introduced, which
controls the number of failed reports that are allowed before the
query is cancelled. --report_status_retry_interval_ms is removed.
- Exponential backoff is used, such that for a period between
retries of 't', on try 'n' the actual timeout will be t * n.
Testing:
- Added a test which results in a large number of failed intermediate
status reports but still succeeds.
Change-Id: Ib6007013fc2c9e8eeba11b752ee58fb3038da971
---
M be/src/common/global-flags.cc
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/fragment-instance-state.cc
M be/src/runtime/fragment-instance-state.h
M be/src/runtime/query-state.cc
M be/src/runtime/query-state.h
M common/protobuf/control_service.proto
M tests/custom_cluster/test_rpc_timeout.py
8 files changed, 192 insertions(+), 64 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/49/12049/4
--
To view, visit http://gerrit.cloudera.org:8080/12049
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ib6007013fc2c9e8eeba11b752ee58fb3038da971
Gerrit-Change-Number: 12049
Gerrit-PatchSet: 4
Gerrit-Owner: Thomas Marshall <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Michael Ho <[email protected]>
Gerrit-Reviewer: Thomas Marshall <[email protected]>