[email protected] has posted comments on this change. ( http://gerrit.cloudera.org:8080/23823 )
Change subject: IMPALA-14618: Ensure processor clear waits for active event processing ...................................................................... Patch Set 2: (4 comments) http://gerrit.cloudera.org:8080/#/c/23823/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/23823/1//COMMIT_MSG@12 PS1, Line 12: by those processors. > Do we need to clear them concurrently if the clear takes time in waiting? DbProcessors in a DbEventExecutor share single thread. So only one of the DbProcessor might wait in a DbEventExecutor thread. And, we use parallelStream to clear the DbEventExecutors concurrently. http://gerrit.cloudera.org:8080/#/c/23823/1//COMMIT_MSG@14 PS1, Line 14: it could update : the cache with stale information after the metadata rebuild > It seems another kind of bug if the rebuilt cache can be updated by a stale Below is the sequence of operations from logs: 1. test_failover_catchup_timeout_and_reset test has "debug_actions=catalogd_event_processing_delay:SLEEP@3000" property that cause event processing time to be more than 3sec. So sleep for 3 seconds in CREATE_DATABASE event(EventId: 61210) processing. Logs: I20251207 11:45:44.909482 2338302 MetastoreEvents.java:310] Total number of events received: 1 Total number of events filtered out: 0 I20251207 11:45:44.911422 2338302 DbEventExecutor.java:716] Assigned executor: DbEventExecutor-0 for db: test_failover_catchup_timeout_and_reset_5d02c505 I20251207 11:45:44.911527 2338302 MetastoreEvents.java:947] EventId: 61210 EventType: CREATE_DATABASE Target: test_failover_catchup_timeout_and_reset_5d02c505. Enqueued for db: test_failover_catchup_timeout_and_reset_5d02c505 on executor: DbEventExecutor-0 I20251207 11:45:44.911571 2338302 MetastoreEvents.java:947] EventId: 61210 EventType: CREATE_DATABASE Target: test_failover_catchup_timeout_and_reset_5d02c505. Dispatch time: 104ms I20251207 11:45:44.911819 2338302 MetastoreEventsProcessor.java:1708] Time elapsed in dispatching event batch: 105.566ms I20251207 11:45:44.914646 2338234 MetastoreEvents.java:947] EventId: 61210 EventType: CREATE_DATABASE Target: test_failover_catchup_timeout_and_reset_5d02c505. Creating event 61210 of type CREATE_DATABASE on database test_failover_catchup_timeout_and_reset_5d02c505 I20251207 11:45:44.915838 2338234 DebugUtils.java:188] Sleeping for 3000 msec to execute debug action catalogd_event_processing_delay I20251207 11:45:45.305532 2338303 MetastoreEventsProcessor.java:1454] Latest event in HMS: id=61211, time=1765136745. Greatest synced event: id=61209, time=1765136730. W20251207 11:45:45.305663 2338303 MetastoreEventsProcessor.java:1459] Lag: 15s. Approximately 2 events pending to be processed. I20251207 11:45:45.305747 2338302 MetastoreEventsProcessor.java:1288] Received 1 events. First event id: 61211. I20251207 11:45:45.311097 2338302 MetastoreEvents.java:310] Total number of events received: 1 Total number of events filtered out: 0 I20251207 11:45:45.311270 2338302 MetastoreEvents.java:947] EventId: 61211 EventType: CREATE_TABLE Target: test_failover_catchup_timeout_and_reset_5d02c505.tbl. Enqueued for db: test_failover_catchup_timeout_and_reset_5d02c505 on executor: DbEventExecutor-0 I20251207 11:45:45.311333 2338302 MetastoreEvents.java:947] EventId: 61211 EventType: CREATE_TABLE Target: test_failover_catchup_timeout_and_reset_5d02c505.tbl. Dispatch time: 5ms I20251207 11:45:45.311496 2338302 MetastoreEventsProcessor.java:1708] Time elapsed in dispatching event batch: 5.580ms I20251207 ----------------------- 2. After catch timeout(catalogd_ha_failover_catchup_timeout_s=2), reset happened. Logs: I20251207 11:45:47.891970 2338157 JniUtil.java:167] getEventProcessorSummary request: Getting event processor summary I20251207 11:45:47.892719 2338157 JniUtil.java:178] Finished getEventProcessorSummary request: Getting event processor summary. Time spent: 1ms W20251207 11:45:47.892760 2338157 catalog-server.cc:968] Timed out waiting for catching up HMS events. I20251207 11:45:47.892794 2338157 catalog-server.cc:989] Marking pending metadata reset. Min number of catalog resets to serve: 2 I20251207 11:45:47.892799 2338157 catalog-server.cc:971] Fallback to resetting all metadata. I20251207 11:45:47.892819 2338157 catalog-server.cc:886] This catalogd instance is changed to active status ----------------------- 3. Reset metadata on catalogd, cleared event executors and acquired catalog version write lock and started with cache rebuild and loaded table list for test_failover_catchup_timeout_and_reset_5d02c505. catalog version write lock is acquired before rebuild at: https://github.com/apache/impala/blob/45051a27672c3a8a5be304ab2c71fc5c02946788/fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java#L2503 Note: Still earlier CREATE_DATABASE event(EventId: 61210) processing is in sleep. Logs: I20251207 11:45:47.917454 2338149 Db.java:535] Setting the catalog version of Db@6345b9bd test_failover_catchup_timeout_and_reset_5d02c505 to 3916 I20251207 11:45:47.917505 2338149 CatalogServiceCatalog.java:2398] Loaded table list (1) and functions (0) for database: test_failover_catchup_timeout_and_reset_5d02c505. Fetch duration: 2 ms. ----------------------- 4. CREATE_DATABASE event(EventId: 61210) sleep finished but waiting for version write lock that was held by reset thread. https://github.com/apache/impala/blob/45051a27672c3a8a5be304ab2c71fc5c02946788/fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java#L2706 ----------------------- 5. Once the reset thread released the lock, addDb() proceeded further and overwritten db object in dbCache_. Logs: I20251207 11:45:47.921928 2338234 Db.java:535] Setting the catalog version of Db@1c257df7 test_failover_catchup_timeout_and_reset_5d02c505 to 4212 I20251207 11:45:47.921983 2338149 CatalogServiceCatalog.java:2553] Invalidated all metadata in 28 ms (2 ms outside version write lock). I20251207 11:45:47.922029 2338234 Db.java:147] createEventId_ for db: test_failover_catchup_timeout_and_reset_5d02c505 set to: 61210 I20251207 11:45:47.922089 2338234 Db.java:159] lastSyncedEventId_ for db: test_failover_catchup_timeout_and_reset_5d02c505 set from -1 to 61210 I20251207 11:45:47.922142 2338149 JniUtil.java:178] Finished resetMetadata request: INVALIDATE ALL issued by null. Time spent: 30ms I20251207 11:45:47.922425 2338234 MetastoreEvents.java:936] EventId: 61210 EventType: CREATE_DATABASE Target: test_failover_catchup_timeout_and_reset_5d02c505. Successfully added database test_failover_catchup_timeout_and_reset_5d02c505 I20251207 11:45:47.922520 2338234 MetastoreEvents.java:936] EventId: 61210 EventType: CREATE_DATABASE Target: test_failover_catchup_timeout_and_reset_5d02c505. Scheduling delay: 4ms, Process time: 3s007ms http://gerrit.cloudera.org:8080/#/c/23823/1//COMMIT_MSG@18 PS1, Line 18: This change ensures that processor clear waits for any ongoing event : processing to complete before clearing internal state. > Single-table ALTER_TABLE events, i.e. non-rename events, might take a long IMO, we need to wait for event processing to completely pause before proceeding with reset. I mean ensuring none of the executor(db/table) threads are still processing any event. We actually do not wait for all pending events to be processed. Just wait for the on-going event(if it is being processed currently) to complete. http://gerrit.cloudera.org:8080/#/c/23823/1/fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java File fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java: http://gerrit.cloudera.org:8080/#/c/23823/1/fe/src/main/java/org/apache/impala/catalog/CatalogServiceCatalog.java@2491 PS1, Line 2491: metastoreEventProcessor_.clear(); > If this could take long, we need to mark it in 'catalogTimeline' when it fi Done -- To view, visit http://gerrit.cloudera.org:8080/23823 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ib04ee2fa425cb4be397232a0065b6fdbfc1e244c Gerrit-Change-Number: 23823 Gerrit-PatchSet: 2 Gerrit-Owner: Anonymous Coward <[email protected]> Gerrit-Reviewer: Anonymous Coward <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Quanlong Huang <[email protected]> Gerrit-Reviewer: Sai Hemanth Gantasala <[email protected]> Gerrit-Comment-Date: Fri, 09 Jan 2026 09:40:08 +0000 Gerrit-HasComments: Yes
