Internal Jenkins has submitted this change and it was merged.

Change subject: IMPALA-3575: Add retry to backend connection request and rpc 
timeout
......................................................................


IMPALA-3575: Add retry to backend connection request and rpc timeout

This patch adds a configurable timeout for all backend client
RPC to avoid query hang issue.

Prior to this change, Impala doesn't set socket send/recv timeout for
backend client. RPC will wait forever for data. In extreme cases
of bad network or destination host has kernel panic, sender will not
get response and RPC will hang. Query hang is hard to detect. If
hang happens at ExecRemoteFragment() or CancelPlanFragments(), query
cannot be canelled unless you restart coordinator.

Added send/recv timeout to all RPCs to avoid query hang. For catalog
client, keep default timeout to 0 (no timeout) because ExecDdl()
could take very long time if table has many partitons, mainly waiting
for HMS API call.

Added a wrapper RetryRpcRecv() to wait for receiver response for
longer time. This is needed by certain RPCs. For example, TransmitData()
by DataStreamSender, receiver could hold response to add back pressure.

If an RPC fails, the connection is left in an unrecoverable state.
we don't put the underlying connection back to cache but close it. This
is to make sure broken connection won't cause more RPC failure.

Added retry for CancelPlanFragment RPC. This reduces the chance that cancel
request gets lost due to unstable network, but this can cause cancellation
takes longer time. and make test_lifecycle.py more flaky.
The metric num-fragments-in-flight might not be 0 yet due to previous tests.
Modified the test to check the metric delta instead of comparing to 0 to
reduce flakyness. However, this might not capture some failures.

Besides the new EE test, I used the following iptables rule to
inject network failure to verify RPCs never hang.
1. Block network traffic on a port completely
  iptables -A INPUT -p tcp -m tcp --dport 22002 -j DROP
2. Randomly drop 5% of TCP packets to slowdown network
  iptables -A INPUT -p tcp -m tcp --dport 22000 -m statistic --mode random 
--probability 0.05 -j DROP

Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
Reviewed-on: http://gerrit.cloudera.org:8080/3343
Reviewed-by: Juan Yu <[email protected]>
Tested-by: Internal Jenkins
---
M be/src/common/global-flags.cc
M be/src/rpc/thrift-util.cc
M be/src/rpc/thrift-util.h
M be/src/runtime/client-cache.cc
M be/src/runtime/client-cache.h
M be/src/runtime/coordinator.cc
M be/src/runtime/data-stream-sender.cc
M be/src/runtime/exec-env.cc
M be/src/service/fragment-exec-state.cc
M be/src/service/impala-internal-service.h
M be/src/statestore/statestore.cc
A be/src/testutil/fault-injection-util.h
M be/src/util/error-util-test.cc
M common/thrift/generate_error_codes.py
A tests/custom_cluster/test_rpc_timeout.py
M tests/query_test/test_lifecycle.py
M tests/verifiers/metric_verifier.py
17 files changed, 397 insertions(+), 64 deletions(-)

Approvals:
  Juan Yu: Looks good to me, approved
  Internal Jenkins: Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/3343
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: Id6723cfe58df6217f4a9cdd12facd320cbc24964
Gerrit-PatchSet: 24
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Juan Yu <[email protected]>
Gerrit-Reviewer: Alan Choi <[email protected]>
Gerrit-Reviewer: Dan Hecht <[email protected]>
Gerrit-Reviewer: Henry Robinson <[email protected]>
Gerrit-Reviewer: Huaisi Xu <[email protected]>
Gerrit-Reviewer: Internal Jenkins
Gerrit-Reviewer: Juan Yu <[email protected]>
Gerrit-Reviewer: Sailesh Mukil <[email protected]>

Reply via email to