[ https://issues.apache.org/jira/browse/IMPALA-14283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Quanlong Huang resolved IMPALA-14283. ------------------------------------- Fix Version/s: Impala 5.0.0 Resolution: Fixed > Coordinator could use stale metadata in catalogd warm failover > -------------------------------------------------------------- > > Key: IMPALA-14283 > URL: https://issues.apache.org/jira/browse/IMPALA-14283 > Project: IMPALA > Issue Type: Bug > Components: Catalog > Reporter: Quanlong Huang > Assignee: Quanlong Huang > Priority: Critical > Fix For: Impala 5.0.0 > > Attachments: catalogd.INFO, catalogd_node1.INFO, impalad.INFO, > show_parts_profile.txt, statestored.INFO > > > There is a corner case in local catalog mode that when catalogd warm failover > finishes, coordinator fetches the new table object from the new active > catalogd and it might coincidently "activate" the stale metadata in the > cache. This happens when matching all the following conditions: > * The new table object has a catalog version that has been used in the > previous active catalogd and it was assigned to the same table name. > * The coordinator has loaded table metadata associated to that version from > the previous active catalogd, e.g. partition list, partitions or column stats > are loaded and still in the cache (not aged out or evicted yet). > * The initial catalog update of the new active catalogd hasn't arrived yet. > There is some delay when statestore broadcast the updates. If the statestore > update arrives before the query planning, we are safe since processing the > update will clear the cache. > This bug is found by *TestCatalogdHA.test_warmed_up_metadata_failover_catchup* > {code:python} > tests/custom_cluster/test_catalogd_ha.py:640: in > test_warmed_up_metadata_failover_catchup > > > db, self._add_new_partition, self._verify_new_partition) > > > tests/custom_cluster/test_catalogd_ha.py:762: in > _test_metadata_after_failover > > > assert str(active_catalogd.service.service_port) in str(e) > > > E assert '26001' in "assert 'year=2025/month=1' in > 'Total\\t\\t-1\\t0\\t0B\\t0B\\t\\t\\t\\t\\t'" > > > E + where '26001' = str(26001) > > > E + where 26001 = <tests.common.impala_service.CatalogdService object > at 0x7f5d8c24fd10>.service_port > > E + where <tests.common.impala_service.CatalogdService object at > 0x7f5d8c24fd10> = <CatalogdProcess PID: None > (/root/Impala-upstream/be/build/latest/service/cata...te_store_subscriber_port=23021 > -webserver_ > port=25021 -enable_catalogd_ha=true)>.service > > > E + and "assert 'year=2025/month=1' in > 'Total\\t\\t-1\\t0\\t0B\\t0B\\t\\t\\t\\t\\t'" = str(AssertionError(u"assert > 'year=2025/month=1' in 'Total\\t\\t-1\\t0\\t0B\\t0B\\t\\t\\t\\t\\t'",)){code} > The test runs an AddPartition statement and kills the active catalogd, then > runs ShowPartitions in the new active catalogd. It fails since coordinator > returns an empty partition list. Test logs: > {noformat} > -- 2025-07-31 17:10:00,755 INFO MainThread: hs2: executing against Impala > at quanlong-ha-test.vpc.cloudera.com:21050. session: > ea4e2fca5d5b35f3:c51bd83bdb2417a9 main_cursor: True user: None > > > > > alter table warmup_test_db.tbl add partition(year=2025, month=1); > > > > > > -- 2025-07-31 17:10:01,119 INFO MainThread: > b84a8a27014a3340:af77289200000000: query started > > > -- 2025-07-31 17:10:01,120 INFO MainThread: > b84a8a27014a3340:af77289200000000: getting log for operation > > > -- 2025-07-31 17:10:01,120 INFO MainThread: > b84a8a27014a3340:af77289200000000: getting runtime profile operation > > > -- 2025-07-31 17:10:01,120 INFO MainThread: > b84a8a27014a3340:af77289200000000: closing query for operation > > > -- 2025-07-31 17:10:01,147 INFO MainThread: Found PID 102378 for > /root/Impala-upstream/be/build/latest/service/catalogd -logbufsecs=5 -v=1 > -max_log_files=10 -log_rotation_match_pid=true -log_filename=catalogd_n > ode1 -log_dir=/tmp/ -kudu_master_hosts localhost --catalog_topic_mode=minimal > --catalogd_ha_reset_metadata_on_failover=false > --debug_actions=catalogd_event_processing_delay:SLEEP@1000 > --enable_reload_events=true -- > warmup_tables_config_file=file:///root/Impala-upstream/testdata/data/warmup_test_config.txt > -catalog_service_port=26001 -state_store_subscriber_port=23021 > -webserver_port=25021 -enable_catalogd_ha=true > -- 2025-07-31 17:10:01,172 INFO MainThread: Killing <CatalogdProcess PID: > 102378 (/root/Impala-upstream/be/build/latest/service/catalogd -logbufsecs=5 > -v=1 -max_log_files=10 -log_rotation_match_pid=true -log_fi > lename=catalogd_node1 -log_dir=/tmp/ -kudu_master_hosts localhost > --catalog_topic_mode=minimal --catalogd_ha_reset_metadata_on_failover=false > --debug_actions=catalogd_event_processing_delay:SLEEP@1000 --enable_relo > ad_events=true > --warmup_tables_config_file=file:///root/Impala-upstream/testdata/data/warmup_test_config.txt > -catalog_service_port=26001 -state_store_subscriber_port=23021 > -webserver_port=25021 -enable_catalogd_ha= > true)> with signal 9 > > > -- 2025-07-31 17:10:01,177 INFO MainThread: Getting metric: > catalog-server.active-status from quanlong-ha-test.vpc.cloudera.com:25020 > > -- 2025-07-31 17:10:01,180 INFO MainThread: Waiting for metric value > 'catalog-server.active-status'=True. Current value: False. total_wait: 0s > -- 2025-07-31 17:10:01,180 INFO MainThread: Sleeping 1s before next > retry. > > -- 2025-07-31 17:10:02,181 INFO MainThread: Getting metric: > catalog-server.active-status from quanlong-ha-test.vpc.cloudera.com:25020 > > -- 2025-07-31 17:10:02,184 INFO MainThread: Metric > 'catalog-server.active-status' has reached desired value: True. total_wait: > 1.00396800041s > -- 2025-07-31 17:10:02,188 INFO MainThread: hs2: executing against Impala > at quanlong-ha-test.vpc.cloudera.com:21050. session: > ea4e2fca5d5b35f3:c51bd83bdb2417a9 main_cursor: True user: None > > > > show partitions warmup_test_db.tbl; > > > > -- 2025-07-31 17:10:03,737 INFO MainThread: > 664e4d97f4712dfb:e4d0877500000000: query started > -- 2025-07-31 17:10:03,738 INFO MainThread: > 664e4d97f4712dfb:e4d0877500000000: getting log for operation > -- 2025-07-31 17:10:03,738 INFO MainThread: > 664e4d97f4712dfb:e4d0877500000000: getting runtime profile operation > -- 2025-07-31 17:10:03,738 INFO MainThread: > 664e4d97f4712dfb:e4d0877500000000: closing query for operation > -- 2025-07-31 17:10:03,739 INFO MainThread: partition result: > ['Total\t\t-1\t0\t0B\t0B\t\t\t\t\t']{noformat} > Profile of ShowPartitions indicates some stale metadata are used: > {noformat} > Original Table Versions: warmup_test_db.tbl, 314, 1754007002807, Thu > Jul 31 17:10:02 PDT 2025 > - CatalogFetch.ColumnStats.Hits: 13 <--- This shouldn't hit > - CatalogFetch.ColumnStats.Misses: 0 > - CatalogFetch.ColumnStats.Requests: 13 > - CatalogFetch.ColumnStats.Time: 0 > - CatalogFetch.DatabaseList.Hits: 1 > - CatalogFetch.DatabaseList.Misses: 0 > - CatalogFetch.DatabaseList.Requests: 1 > - CatalogFetch.DatabaseList.Time: 0 > - CatalogFetch.PartitionLists.Hits: 1 <--- This shouldn't hit > - CatalogFetch.PartitionLists.Misses: 0 > - CatalogFetch.PartitionLists.Requests: 1 > - CatalogFetch.PartitionLists.Time: 0 > - CatalogFetch.RPCs.Bytes: 1.78 KB (1818) > - CatalogFetch.RPCs.Requests: 2 > - CatalogFetch.RPCs.Time: 1s531ms > - CatalogFetch.StorageLoad.Time: 22ms > - CatalogFetch.TableList.Hits: 1 > - CatalogFetch.TableList.Misses: 1 > - CatalogFetch.TableList.Requests: 2 > - CatalogFetch.TableList.Time: 1s516ms > - CatalogFetch.Tables.Hits: 0 > - CatalogFetch.Tables.Misses: 1 > - CatalogFetch.Tables.Requests: 1 > - CatalogFetch.Tables.Time: 15ms > - ExecutorGroupsConsidered: 1 (1){noformat} > In logs of the new active catalogd (catalogd.INFO), the catalog version of > the table is 314: > {noformat} > I20250731 17:10:02.760269 103771 CatalogOpExecutor.java:4872] Found 1/1 > partitions to add in table warmup_test_db.tbl from event 888 > I20250731 17:10:02.807034 103771 HdfsTable.java:781] Loaded file and block > metadata for warmup_test_db.tbl partitions: year=2025/month=1. Time taken: > 27.856ms > I20250731 17:10:02.807621 103771 MetastoreEvents.java:860] EventId: 888 > EventType: ADD_PARTITION Target: warmup_test_db.tbl. Successfully added 1 > partitions to table warmup_test_db.tbl > I20250731 17:10:02.807871 103771 MetastoreEventsProcessor.java:1578] Time > elapsed in processing event batch: 1s156ms > I20250731 17:10:03.673661 103802 JniUtil.java:167] getEventProcessorSummary > request: Getting event processor summary > I20250731 17:10:03.674260 103802 JniUtil.java:178] Finished > getEventProcessorSummary request: Getting event processor summary. Time > spent: 1ms > I20250731 17:10:03.674295 103802 catalog-server.cc:978] EventProcessor is > synced up with HMS event id during failover: 888 > I20250731 17:10:03.674309 103802 catalog-server.cc:887] This catalogd > instance is changed to active status > I20250731 17:10:03.677537 103773 JniUtil.java:167] getCatalogDelta request: > Getting catalog delta from version 0 > I20250731 17:10:03.687700 103773 catalog-server.cc:1615] Collected update: > 2:DATABASE:default, version=2, original size=48, compressed size=49 > ... > I20250731 17:10:03.720110 103773 CatalogServiceCatalog.java:1965] Skipped 0 > partitions of table warmup_test_db.tbl in the incremental update > I20250731 17:10:03.720285 103773 CatalogServiceCatalog.java:1978] Collected 1 > partition update(s): 2:HDFS_PARTITION:warmup_test_db.tbl:year=2025/month=1, > version=314, original size=100, compressed size=101 > I20250731 17:10:03.720324 103773 catalog-server.cc:1615] Collected update: > 2:TABLE:warmup_test_db.tbl, version=314, original size=65, compressed > size=66{noformat} > In the previous catalogd logs (catalogd_node1.INFO), the table version before > AddPartition is also 314! > {noformat} > I20250731 17:09:58.553951 103647 HdfsTable.java:1256] Loading metadata for > table definition and all partition(s) of warmup_test_db.tbl (needed by > coordinator) > I20250731 17:09:58.600651 103647 HdfsTable.java:1919] Loaded 13 columns from > HMS. Actual columns: 13 > I20250731 17:09:58.623989 103647 HdfsTable.java:2987] Load Valid Write Id > List Done. Time taken: 24.253us > I20250731 17:09:58.624092 103647 HdfsTable.java:1303] Fetching partition > metadata from the Metastore: warmup_test_db.tbl > I20250731 17:09:58.628580 103647 MetaStoreUtil.java:191] Fetching 0 > partitions for: warmup_test_db.tbl using partition batch size: 1000 > I20250731 17:09:58.628685 103647 HdfsTable.java:1310] Fetched partition > metadata from the Metastore: warmup_test_db.tbl > I20250731 17:09:58.649619 103647 HdfsTable.java:781] Loaded file and block > metadata for warmup_test_db.tbl partitions: . Time taken: 3.319ms > I20250731 17:09:58.651077 103647 Table.java:1106] last refreshed event id for > table: warmup_test_db.tbl set to: -1 > I20250731 17:09:58.651185 103647 TableLoader.java:187] Loaded metadata for: > warmup_test_db.tbl (129ms) > I20250731 17:09:58.685761 102954 JniUtil.java:167] getCatalogDelta request: > Getting catalog delta from version 313 > I20250731 17:09:58.693717 102954 CatalogServiceCatalog.java:1965] Skipped 0 > partitions of table warmup_test_db.tbl in the incremental update > I20250731 17:09:58.693792 102954 catalog-server.cc:1615] Collected update: > 2:TABLE:warmup_test_db.tbl, version=314, original size=65, compressed > size=66{noformat} > So coordinator has loaded the empty partition list of it in cache. It's > associated with table name "warmup_test_db.tbl" and catalog version 314. Then > coincidently reused after catalogd warm failover. > We should add the catalog service id in the getPartialCatalogObject response > and coordinator should clear the cache when the id changes, just like what we > do in updateCatalogCache() -> > [witnessCatalogServiceId()|https://github.com/apache/impala/blob/5414c3009890a736901b6ae3b56b9953b1992de7/fe/src/main/java/org/apache/impala/catalog/local/CatalogdMetaProvider.java#L1614]. -- This message was sent by Atlassian Jira (v8.20.10#820010)