Hello Wenzhe Zhou, Impala Public Jenkins,
I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/21146
to look at the new patch set (#4).
Change subject: IMPALA-12699: Set timeout for catalog RPCs
......................................................................
IMPALA-12699: Set timeout for catalog RPCs
We have seen trivial GetPartialCatalogObject RPCs hanging in coordinator
side, e.g. IMPALA-11409. Due to the piggyback mechanism of fetching
metadata in local-catalog mode (see comments in
CatalogdMetaProvider#loadWithCaching()), a hanging RPC on shared
metadata (e.g. db/table list) could block other queries on the same
coordinator.
Such lightweight requests don't need to acquire table lock or trigger
table loading in catalogd. The causes of the hanging are usually
network issues, e.g. TCP connection become half open due to TCP
retransmissions timed out. A retry on the RPC helps to recover from such
failures. Currently, the timeout for catalog RPC is set to 0 by default.
This prevent the retry and let the client to wait infinitely.
This patch distinguishes the lightweight catalog RPCs and uses a
dedicated catalogd client cache for them. They use a timeout of 30 mins
which is longer enough to tolerate TCP retransmission timeouts.
Also sets a timeout of 10 hours for other catalog RPCs. Operations take
longer than that are usually abnormal and hanging.
Tests
- Add e2e test to verify the lightweight RPC client cache is used.
- Adjust TestRestart.test_catalog_connection_retries to use local
catalog mode since in the legacy catalog mode, coordinator only sends
PrioritizeLoad requests which are lightweight RPCs.
This is a continuation of patch by Wenzhe Zhou <[email protected]>
Change-Id: Iad39a79d0c89f2b04380f610a7e60558429e9c6e
---
M be/src/exec/catalog-op-executor.cc
M be/src/runtime/client-cache.cc
M be/src/runtime/client-cache.h
M be/src/runtime/exec-env.cc
M be/src/runtime/exec-env.h
M common/thrift/metrics.json
M tests/custom_cluster/test_local_catalog.py
M tests/custom_cluster/test_restart_services.py
8 files changed, 83 insertions(+), 11 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/46/21146/4
--
To view, visit http://gerrit.cloudera.org:8080/21146
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iad39a79d0c89f2b04380f610a7e60558429e9c6e
Gerrit-Change-Number: 21146
Gerrit-PatchSet: 4
Gerrit-Owner: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Wenzhe Zhou <[email protected]>