[email protected] has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/23823 )

Change subject: IMPALA-14618: Ensure processor clear waits for active event 
processing
......................................................................


Patch Set 2:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/23823/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/23823/1//COMMIT_MSG@12
PS1, Line 12: by those processors.
> Do we need to clear them concurrently if the clear takes time in waiting?
DbProcessors in a DbEventExecutor share single thread. So only one of the 
DbProcessor might wait in a DbEventExecutor thread.

And, we use parallelStream to clear the DbEventExecutors concurrently.


http://gerrit.cloudera.org:8080/#/c/23823/1//COMMIT_MSG@14
PS1, Line 14: it could update
            : the cache with stale information after the metadata rebuild
> It seems another kind of bug if the rebuilt cache can be updated by a stale
Below is the sequence of operations from logs:

1. test_failover_catchup_timeout_and_reset test has 
"debug_actions=catalogd_event_processing_delay:SLEEP@3000" property that cause 
event processing time to be more than 3sec. So sleep for 3 seconds in 
CREATE_DATABASE event(EventId: 61210) processing.

Logs:
I20251207 11:45:44.909482 2338302 MetastoreEvents.java:310] Total number of 
events received: 1 Total number of events filtered out: 0
I20251207 11:45:44.911422 2338302 DbEventExecutor.java:716] Assigned executor: 
DbEventExecutor-0 for db: test_failover_catchup_timeout_and_reset_5d02c505
I20251207 11:45:44.911527 2338302 MetastoreEvents.java:947] EventId: 61210 
EventType: CREATE_DATABASE Target: 
test_failover_catchup_timeout_and_reset_5d02c505. Enqueued for db: 
test_failover_catchup_timeout_and_reset_5d02c505 on executor: DbEventExecutor-0
I20251207 11:45:44.911571 2338302 MetastoreEvents.java:947] EventId: 61210 
EventType: CREATE_DATABASE Target: 
test_failover_catchup_timeout_and_reset_5d02c505. Dispatch time: 104ms
I20251207 11:45:44.911819 2338302 MetastoreEventsProcessor.java:1708] Time 
elapsed in dispatching event batch: 105.566ms
I20251207 11:45:44.914646 2338234 MetastoreEvents.java:947] EventId: 61210 
EventType: CREATE_DATABASE Target: 
test_failover_catchup_timeout_and_reset_5d02c505. Creating event 61210 of type 
CREATE_DATABASE on database test_failover_catchup_timeout_and_reset_5d02c505
I20251207 11:45:44.915838 2338234 DebugUtils.java:188] Sleeping for 3000 msec 
to execute debug action catalogd_event_processing_delay
I20251207 11:45:45.305532 2338303 MetastoreEventsProcessor.java:1454] Latest 
event in HMS: id=61211, time=1765136745. Greatest synced event: id=61209, 
time=1765136730.
W20251207 11:45:45.305663 2338303 MetastoreEventsProcessor.java:1459] Lag: 15s. 
Approximately 2 events pending to be processed.
I20251207 11:45:45.305747 2338302 MetastoreEventsProcessor.java:1288] Received 
1 events. First event id: 61211.
I20251207 11:45:45.311097 2338302 MetastoreEvents.java:310] Total number of 
events received: 1 Total number of events filtered out: 0
I20251207 11:45:45.311270 2338302 MetastoreEvents.java:947] EventId: 61211 
EventType: CREATE_TABLE Target: 
test_failover_catchup_timeout_and_reset_5d02c505.tbl. Enqueued for db: 
test_failover_catchup_timeout_and_reset_5d02c505 on executor: DbEventExecutor-0
I20251207 11:45:45.311333 2338302 MetastoreEvents.java:947] EventId: 61211 
EventType: CREATE_TABLE Target: 
test_failover_catchup_timeout_and_reset_5d02c505.tbl. Dispatch time: 5ms
I20251207 11:45:45.311496 2338302 MetastoreEventsProcessor.java:1708] Time 
elapsed in dispatching event batch: 5.580ms
I20251207
-----------------------

2. After catch timeout(catalogd_ha_failover_catchup_timeout_s=2), reset 
happened.

Logs:
I20251207 11:45:47.891970 2338157 JniUtil.java:167] getEventProcessorSummary 
request: Getting event processor summary
I20251207 11:45:47.892719 2338157 JniUtil.java:178] Finished 
getEventProcessorSummary request: Getting event processor summary. Time spent: 
1ms
W20251207 11:45:47.892760 2338157 catalog-server.cc:968] Timed out waiting for 
catching up HMS events.
I20251207 11:45:47.892794 2338157 catalog-server.cc:989] Marking pending 
metadata reset. Min number of catalog resets to serve: 2
I20251207 11:45:47.892799 2338157 catalog-server.cc:971] Fallback to resetting 
all metadata.
I20251207 11:45:47.892819 2338157 catalog-server.cc:886] This catalogd instance 
is changed to active status
-----------------------

3. Reset metadata on catalogd, cleared event executors and acquired catalog 
version write lock and started with cache rebuild and loaded table list for 
test_failover_catchup_timeout_and_reset_5d02c505. catalog version write lock is 
acquired before rebuild at:
https://github.com/apache/impala/blob/45051a27672c3a8a5be304ab2c71fc5c02946788/fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java#L2503

Note: Still earlier CREATE_DATABASE event(EventId: 61210) processing is in 
sleep.

Logs:
I20251207 11:45:47.917454 2338149 Db.java:535] Setting the catalog version of 
Db@6345b9bd test_failover_catchup_timeout_and_reset_5d02c505 to 3916
I20251207 11:45:47.917505 2338149 CatalogServiceCatalog.java:2398] Loaded table 
list (1) and functions (0) for database: 
test_failover_catchup_timeout_and_reset_5d02c505. Fetch duration: 2 ms.
-----------------------

4. CREATE_DATABASE event(EventId: 61210) sleep finished but waiting for version 
write lock that was held by reset thread.
https://github.com/apache/impala/blob/45051a27672c3a8a5be304ab2c71fc5c02946788/fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java#L2706
-----------------------

5. Once the reset thread released the lock, addDb() proceeded further and 
overwritten db object in dbCache_.

Logs:
I20251207 11:45:47.921928 2338234 Db.java:535] Setting the catalog version of 
Db@1c257df7 test_failover_catchup_timeout_and_reset_5d02c505 to 4212
I20251207 11:45:47.921983 2338149 CatalogServiceCatalog.java:2553] Invalidated 
all metadata in 28 ms (2 ms outside version write lock).
I20251207 11:45:47.922029 2338234 Db.java:147] createEventId_ for db: 
test_failover_catchup_timeout_and_reset_5d02c505 set to: 61210
I20251207 11:45:47.922089 2338234 Db.java:159] lastSyncedEventId_ for db: 
test_failover_catchup_timeout_and_reset_5d02c505 set from -1 to 61210
I20251207 11:45:47.922142 2338149 JniUtil.java:178] Finished resetMetadata 
request: INVALIDATE ALL issued by null. Time spent: 30ms
I20251207 11:45:47.922425 2338234 MetastoreEvents.java:936] EventId: 61210 
EventType: CREATE_DATABASE Target: 
test_failover_catchup_timeout_and_reset_5d02c505. Successfully added database 
test_failover_catchup_timeout_and_reset_5d02c505
I20251207 11:45:47.922520 2338234 MetastoreEvents.java:936] EventId: 61210 
EventType: CREATE_DATABASE Target: 
test_failover_catchup_timeout_and_reset_5d02c505. Scheduling delay: 4ms, 
Process time: 3s007ms


http://gerrit.cloudera.org:8080/#/c/23823/1//COMMIT_MSG@18
PS1, Line 18: This change ensures that processor clear waits for any ongoing 
event
            : processing to complete before clearing internal state.
> Single-table ALTER_TABLE events, i.e. non-rename events, might take a long
IMO, we need to wait for event processing to completely pause before proceeding 
with reset. I mean ensuring none of the executor(db/table) threads are still 
processing any event.
We actually do not wait for all pending events to be processed. Just wait for 
the on-going event(if it is being processed currently) to complete.


http://gerrit.cloudera.org:8080/#/c/23823/1/fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java
File fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java:

http://gerrit.cloudera.org:8080/#/c/23823/1/fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java@2491
PS1, Line 2491:     metastoreEventProcessor_.clear();
> If this could take long, we need to mark it in 'catalogTimeline' when it fi
Done



--
To view, visit http://gerrit.cloudera.org:8080/23823
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ib04ee2fa425cb4be397232a0065b6fdbfc1e244c
Gerrit-Change-Number: 23823
Gerrit-PatchSet: 2
Gerrit-Owner: Anonymous Coward <[email protected]>
Gerrit-Reviewer: Anonymous Coward <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Quanlong Huang <[email protected]>
Gerrit-Reviewer: Sai Hemanth Gantasala <[email protected]>
Gerrit-Comment-Date: Fri, 09 Jan 2026 09:40:08 +0000
Gerrit-HasComments: Yes

Reply via email to