[
https://issues.apache.org/jira/browse/IMPALA-11409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17830811#comment-17830811
]
ASF subversion and git services commented on IMPALA-11409:
----------------------------------------------------------
Commit 9dcd136df13df632d90636abb97e9e168f1d8a89 in impala's branch
refs/heads/master from stiga-huang
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=9dcd136df ]
IMPALA-12699: Set timeout for catalog RPCs
We have seen trivial GetPartialCatalogObject RPCs hanging in coordinator
side, e.g. IMPALA-11409. Due to the piggyback mechanism of fetching
metadata in local-catalog mode (see comments in
CatalogdMetaProvider#loadWithCaching()), a hanging RPC on shared
metadata (e.g. db/table list) could block other queries on the same
coordinator.
Such lightweight requests don't need to acquire table lock or trigger
table loading in catalogd. The causes of the hanging are usually
network issues, e.g. TCP connection become half open due to TCP
retransmissions timed out. A retry on the RPC helps to recover from such
failures. Currently, the timeout for catalog RPC is set to 0 by default.
This prevent the retry and let the client to wait infinitely.
This patch distinguishes the lightweight catalog RPCs and uses a
dedicated catalogd client cache for them. They use a timeout of 30 mins
which is longer enough to tolerate TCP retransmission timeouts.
Also sets a timeout of 10 hours for other catalog RPCs. Operations take
longer than that are usually abnormal and hanging.
Tests
- Add e2e test to verify the lightweight RPC client cache is used.
- Adjust TestRestart.test_catalog_connection_retries to use local
catalog mode since in the legacy catalog mode, coordinator only sends
PrioritizeLoad requests which are lightweight RPCs.
This is a continuation of patch by Wenzhe Zhou <[email protected]>
Change-Id: Iad39a79d0c89f2b04380f610a7e60558429e9c6e
Reviewed-on: http://gerrit.cloudera.org:8080/21146
Reviewed-by: Wenzhe Zhou <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>
> Skip UpdateCatalogMetrics if another thead is on-going in it
> ------------------------------------------------------------
>
> Key: IMPALA-11409
> URL: https://issues.apache.org/jira/browse/IMPALA-11409
> Project: IMPALA
> Issue Type: Improvement
> Components: Backend
> Reporter: Quanlong Huang
> Assignee: Quanlong Huang
> Priority: Critical
> Attachments: jstack-1.txt
>
>
> Impala coordinator tracks local metrics of the catalog, e.g. number of
> dbs/tables. When use_local_catalog is enabled, it also tracks the cache
> metrics, e.g. cache hit/miss count/rate.
> These metrics are updated at the end of each statement, even for simple
> statements like "USE <db>", "SET var=xxx", "SELECT 1". The catalog update
> thread will also update the metrics in the end.
> [https://github.com/apache/impala/blob/bb610dee09a8069bb993b4c668f7e481c1774b70/be/src/service/impala-server.cc#L1272]
> [https://github.com/apache/impala/blob/bb610dee09a8069bb993b4c668f7e481c1774b70/be/src/service/impala-server.cc#L2065]
> These metrics are global metrics of the local catalog cache. They are not
> specifit to a single statement. It's a waste to update the metrics
> concurrently.
> [https://github.com/apache/impala/blob/bb610dee09a8069bb993b4c668f7e481c1774b70/be/src/service/impala-server.cc#L1526-L1559]
> We've seen "hanging issues" that all statements, including the catalog update
> thread, are slowly executing the UpdateCatalogMetrics() function. See details
> in the attached jstack dump.
> Indeed, if one thread is running the UpdateCatalogMetrics() function, the
> other threads can skip it and move forward.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]