Thomas Marshall has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/12049


Change subject: IMPALA-4555: Make QueryState's status reporting more robust
......................................................................

IMPALA-4555: Make QueryState's status reporting more robust

QueryState periodically collects runtime profiles from all of its
fragment instances and sends them to the coordinator. Previously, each
time this happens, if the rpc fails, QueryState will retry twice after
a configurable timeout and then cancel the fragment instances under
the assumption that the coordinator no longer exists.

We've found in real clusters that this logic is too sensitive to
failed rpcs and can result in fragment instances being cancelled even
in cases where the coordinator is still running.

This patch makes a few improvements to this logic:
- When an intermediate report is generated, if the first attempt to
  send it to the coordinator fails, the report is discarded. Only the
  final report, when all of the fragments instances have completed, is
  retried.
- Exponential backoff is used, both for the time between generating
  intermediate reports (controlled by FLAG_status_report_interval_ms)
  and for the time between retries for the final report (controlled by
  FLAG_report_status_retry_interval_ms) such that for a period between
  retries of 't', on try 'n' the actual timeout will be t * n.

Testing:
- Added a test which results in a large number of failed intermediate
  status reports but still succeeds.

Change-Id: Ib6007013fc2c9e8eeba11b752ee58fb3038da971
---
M be/src/runtime/coordinator-backend-state.cc
M be/src/runtime/fragment-instance-state.cc
M be/src/runtime/fragment-instance-state.h
M be/src/runtime/query-state.cc
M be/src/runtime/query-state.h
M be/src/runtime/runtime-state.cc
M be/src/runtime/runtime-state.h
M tests/custom_cluster/test_rpc_timeout.py
8 files changed, 61 insertions(+), 15 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/49/12049/1
--
To view, visit http://gerrit.cloudera.org:8080/12049
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: Ib6007013fc2c9e8eeba11b752ee58fb3038da971
Gerrit-Change-Number: 12049
Gerrit-PatchSet: 1
Gerrit-Owner: Thomas Marshall <thomasmarsh...@cmu.edu>

Reply via email to