[
https://issues.apache.org/jira/browse/IMPALA-14220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18010682#comment-18010682
]
ASF subversion and git services commented on IMPALA-14220:
----------------------------------------------------------
Commit 5fc66bfabc7e5328f024287df7ac5e4aeadb87c5 in impala's branch
refs/heads/master from Riza Suminto
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=5fc66bfab ]
IMPALA-14220 (part 2): Delay AcceptRequest until catalog is stable
CatalogD availability is improving since reading is_active_ no longer
requires holding catalog_lock_. However, during a failover scenario,
requests may slip into the passive-turn-to-active CatalogD and obtain
stale metadata.
This patch improves the situation in two steps. First, it adds a new
mutex ha_transition_lock_ that must be obtained by AcceptRequest() in HA
mode. This mutex protects both CatalogServer::WaitPendingResetStarts() and
CatalogServer::UpdateActiveCatalogd(). WaitPendingResetStarts() will
only exit and return to AcceptRequest() after the triggered_first_reset_
flag is True (initial metadata reset has completed) or
min_catalog_resets_to_serve_ is met. If only the latter happens,
request will goes through the Catalog JVM and subsequently blocked by
CatalogResetManager.waitOngoingMetadataFetch() until metadata reset has
progress beyond requested database/table.
Second, it increments numCatalogResetStarts_ on every global reset
(Invalidate Metadata) initiated by catalog-server.cc.
CatalogServer::MarkPendingMetadataReset() matches this logic to
increment min_catalog_resets_to_serve_ before setting
triggered_first_reset_ flag to False (consequently waking up
TriggerResetMetadata thread).
Rename WaitForCatalogReady() to
WaitCatalogReadinessForWorkloadManagement() since this wait mechanism is
specific to Workload Management initialization and has stricter
requirements.
Removed CatalogServer::IsActive() since the only call site is replaced
with CatalogServer::WaitHATransition().
Testing:
Added test_metadata_after_failover_with_delayed_reset and
test_metadata_after_failover_with_hms_sync.
Change-Id: I370d21319335318e441ec3c3455bac4227803900
Reviewed-on: http://gerrit.cloudera.org:8080/23194
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>
> IsActive checks blocked by the getCatalogDelta operation when there are slow
> DDLs
> ---------------------------------------------------------------------------------
>
> Key: IMPALA-14220
> URL: https://issues.apache.org/jira/browse/IMPALA-14220
> Project: IMPALA
> Issue Type: Bug
> Components: Backend, Catalog
> Reporter: Quanlong Huang
> Assignee: Riza Suminto
> Priority: Blocker
> Fix For: Impala 5.0.0
>
>
> When catalogd HA is enabled, catalogd will check whther it's the active one
> before serving each request, i.e. in
> [AcceptRequest()|https://github.com/apache/impala/blob/8d56eea72518aa11a36aa086dc8961bc8cdbd1fd/be/src/catalog/catalog-server.cc#L593]:
> {code:cpp}
> Status AcceptRequest(CatalogServiceVersion::type client_version) {
> ...
> } else if (FLAGS_enable_catalogd_ha && !catalog_server_->IsActive()) {
> status = Status(Substitute("Request for Catalog service is rejected
> since "
> "catalogd $0 is in standby mode", server_address_));
> }
> {code}
> This check requires holding the catalog_lock_:
> {code:cpp}
> bool CatalogServer::IsActive() {
> lock_guard<mutex> l(catalog_lock_);
> return is_active_;
> }{code}
> [https://github.com/apache/impala/blob/8d56eea72518aa11a36aa086dc8961bc8cdbd1fd/be/src/catalog/catalog-server.cc#L896]
> This lock is also held by
> [GatherCatalogUpdatesThread|https://github.com/apache/impala/blob/8d56eea72518aa11a36aa086dc8961bc8cdbd1fd/be/src/catalog/catalog-server.cc#L905]
> (a.k.a. topic update thread) which invokes JNI method GetCatalogDelta to
> collect catalog updates.
> It's known that collecting catalog updates could be blocked by slow DDLs that
> holding the table lock for a long time (IMPALA-6671). The topic update thread
> usually waits for 1 minute (configured by topic_update_tbl_max_wait_time_ms /
> 2) on the table lock and then skips it with a warning like this:
> {noformat}
> Table tpch.lineitem (version=2373, lastSeen=2373) is skipping topic update
> (2387, 2388] due to lock contention{noformat}
> If the table hasn't been collected 3 consecutive times (configured by
> catalog_max_lock_skipped_topic_updates), topic update thread will wait
> infinitely on it in the next time.
> So when the topic update thread is slow in collecting one round of updates,
> it holds the catalog_lock_ for a long time and blocks all new requests on
> this catalogd. This impacts performance on all queries that requires loading
> metadata from catalogd.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]