Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/21146 )
Change subject: IMPALA-12699: Set timeout for catalog RPCs ...................................................................... IMPALA-12699: Set timeout for catalog RPCs We have seen trivial GetPartialCatalogObject RPCs hanging in coordinator side, e.g. IMPALA-11409. Due to the piggyback mechanism of fetching metadata in local-catalog mode (see comments in CatalogdMetaProvider#loadWithCaching()), a hanging RPC on shared metadata (e.g. db/table list) could block other queries on the same coordinator. Such lightweight requests don't need to acquire table lock or trigger table loading in catalogd. The causes of the hanging are usually network issues, e.g. TCP connection become half open due to TCP retransmissions timed out. A retry on the RPC helps to recover from such failures. Currently, the timeout for catalog RPC is set to 0 by default. This prevent the retry and let the client to wait infinitely. This patch distinguishes the lightweight catalog RPCs and uses a dedicated catalogd client cache for them. They use a timeout of 30 mins which is longer enough to tolerate TCP retransmission timeouts. Also sets a timeout of 10 hours for other catalog RPCs. Operations take longer than that are usually abnormal and hanging. Tests - Add e2e test to verify the lightweight RPC client cache is used. - Adjust TestRestart.test_catalog_connection_retries to use local catalog mode since in the legacy catalog mode, coordinator only sends PrioritizeLoad requests which are lightweight RPCs. This is a continuation of patch by Wenzhe Zhou <[email protected]> Change-Id: Iad39a79d0c89f2b04380f610a7e60558429e9c6e Reviewed-on: http://gerrit.cloudera.org:8080/21146 Reviewed-by: Wenzhe Zhou <[email protected]> Tested-by: Impala Public Jenkins <[email protected]> --- M be/src/exec/catalog-op-executor.cc M be/src/runtime/client-cache.cc M be/src/runtime/client-cache.h M be/src/runtime/exec-env.cc M be/src/runtime/exec-env.h M common/thrift/metrics.json M tests/custom_cluster/test_local_catalog.py M tests/custom_cluster/test_restart_services.py 8 files changed, 83 insertions(+), 11 deletions(-) Approvals: Wenzhe Zhou: Looks good to me, approved Impala Public Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/21146 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: Iad39a79d0c89f2b04380f610a7e60558429e9c6e Gerrit-Change-Number: 21146 Gerrit-PatchSet: 7 Gerrit-Owner: Quanlong Huang <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Reviewer: Wenzhe Zhou <[email protected]>
