Quanlong Huang created IMPALA-14220: ---------------------------------------
Summary: IsActive checks blocked by the getCatalogDelta operation when there are slow DDLs Key: IMPALA-14220 URL: https://issues.apache.org/jira/browse/IMPALA-14220 Project: IMPALA Issue Type: Bug Components: Backend, Catalog Reporter: Quanlong Huang When catalogd HA is enabled, catalogd will check whther it's the active one before serving each request, i.e. in [AcceptRequest()|https://github.com/apache/impala/blob/8d56eea72518aa11a36aa086dc8961bc8cdbd1fd/be/src/catalog/catalog-server.cc#L593]: {code:cpp} Status AcceptRequest(CatalogServiceVersion::type client_version) { ... } else if (FLAGS_enable_catalogd_ha && !catalog_server_->IsActive()) { status = Status(Substitute("Request for Catalog service is rejected since " "catalogd $0 is in standby mode", server_address_)); } {code} This check requires holding the catalog_lock_: {code:cpp} bool CatalogServer::IsActive() { lock_guard<mutex> l(catalog_lock_); return is_active_; }{code} [https://github.com/apache/impala/blob/8d56eea72518aa11a36aa086dc8961bc8cdbd1fd/be/src/catalog/catalog-server.cc#L896] This lock is also held by [GatherCatalogUpdatesThread|https://github.com/apache/impala/blob/8d56eea72518aa11a36aa086dc8961bc8cdbd1fd/be/src/catalog/catalog-server.cc#L905] (a.k.a. topic update thread) which invokes JNI method GetCatalogDelta to collect catalog updates. It's known that collecting catalog updates could be blocked by slow DDLs that holding the table lock for a long time (IMPALA-6671). The topic update thread usually waits for 1 minute (configured by topic_update_tbl_max_wait_time_ms / 2) on the table lock and then skips it with a warning like this: {noformat} Table tpch.lineitem (version=2373, lastSeen=2373) is skipping topic update (2387, 2388] due to lock contention{noformat} If the table hasn't been collected 3 consecutive times (configured by catalog_max_lock_skipped_topic_updates), topic update thread will wait infinitely on it in the next time. So when the topic update thread is slow in collecting one round of updates, it holds the catalog_lock_ for a long time and blocks all new requests on this catalogd. This impacts performance on all queries that requires loading metadata from catalogd. -- This message was sent by Atlassian Jira (v8.20.10#820010)