[ 
https://issues.apache.org/jira/browse/IMPALA-14283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang resolved IMPALA-14283.
-------------------------------------
    Fix Version/s: Impala 5.0.0
       Resolution: Fixed

> Coordinator could use stale metadata in catalogd warm failover
> --------------------------------------------------------------
>
>                 Key: IMPALA-14283
>                 URL: https://issues.apache.org/jira/browse/IMPALA-14283
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Catalog
>            Reporter: Quanlong Huang
>            Assignee: Quanlong Huang
>            Priority: Critical
>             Fix For: Impala 5.0.0
>
>         Attachments: catalogd.INFO, catalogd_node1.INFO, impalad.INFO, 
> show_parts_profile.txt, statestored.INFO
>
>
> There is a corner case in local catalog mode that when catalogd warm failover 
> finishes, coordinator fetches the new table object from the new active 
> catalogd and it might coincidently "activate" the stale metadata in the 
> cache. This happens when matching all the following conditions:
>  * The new table object has a catalog version that has been used in the 
> previous active catalogd and it was assigned to the same table name.
>  * The coordinator has loaded table metadata associated to that version from 
> the previous active catalogd, e.g. partition list, partitions or column stats 
> are loaded and still in the cache (not aged out or evicted yet).
>  * The initial catalog update of the new active catalogd hasn't arrived yet. 
> There is some delay when statestore broadcast the updates. If the statestore 
> update arrives before the query planning, we are safe since processing the 
> update will clear the cache.
> This bug is found by *TestCatalogdHA.test_warmed_up_metadata_failover_catchup*
> {code:python}
> tests/custom_cluster/test_catalogd_ha.py:640: in 
> test_warmed_up_metadata_failover_catchup                                      
>                                                                               
>          
>     db, self._add_new_partition, self._verify_new_partition)                  
>                                                                               
>                                                           
> tests/custom_cluster/test_catalogd_ha.py:762: in 
> _test_metadata_after_failover                                                 
>                                                                               
>          
>     assert str(active_catalogd.service.service_port) in str(e)                
>                                                                               
>                                                           
> E   assert '26001' in "assert 'year=2025/month=1' in 
> 'Total\\t\\t-1\\t0\\t0B\\t0B\\t\\t\\t\\t\\t'"                                 
>                                                                               
>      
> E    +  where '26001' = str(26001)                                            
>                                                                               
>                                                           
> E    +    where 26001 = <tests.common.impala_service.CatalogdService object 
> at 0x7f5d8c24fd10>.service_port                                               
>                                                             
> E    +      where <tests.common.impala_service.CatalogdService object at 
> 0x7f5d8c24fd10> = <CatalogdProcess PID: None 
> (/root/Impala-upstream/be/build/latest/service/cata...te_store_subscriber_port=23021
>  -webserver_
> port=25021 -enable_catalogd_ha=true)>.service                                 
>                                                                               
>                                                           
> E    +  and   "assert 'year=2025/month=1' in 
> 'Total\\t\\t-1\\t0\\t0B\\t0B\\t\\t\\t\\t\\t'" = str(AssertionError(u"assert 
> 'year=2025/month=1' in 'Total\\t\\t-1\\t0\\t0B\\t0B\\t\\t\\t\\t\\t'",)){code}
> The test runs an AddPartition statement and kills the active catalogd, then 
> runs ShowPartitions in the new active catalogd. It fails since coordinator 
> returns an empty partition list. Test logs:
> {noformat}
> -- 2025-07-31 17:10:00,755 INFO     MainThread: hs2: executing against Impala 
> at quanlong-ha-test.vpc.cloudera.com:21050. session: 
> ea4e2fca5d5b35f3:c51bd83bdb2417a9 main_cursor: True user: None                
>      
>                                                                               
>                                                                               
>                                                           
> alter table warmup_test_db.tbl add partition(year=2025, month=1);             
>                                                                               
>                                                           
>                                                                               
>                                                                               
>                                                           
> -- 2025-07-31 17:10:01,119 INFO     MainThread: 
> b84a8a27014a3340:af77289200000000: query started                              
>                                                                               
>           
> -- 2025-07-31 17:10:01,120 INFO     MainThread: 
> b84a8a27014a3340:af77289200000000: getting log for operation                  
>                                                                               
>           
> -- 2025-07-31 17:10:01,120 INFO     MainThread: 
> b84a8a27014a3340:af77289200000000: getting runtime profile operation          
>                                                                               
>           
> -- 2025-07-31 17:10:01,120 INFO     MainThread: 
> b84a8a27014a3340:af77289200000000: closing query for operation                
>                                                                               
>           
> -- 2025-07-31 17:10:01,147 INFO     MainThread: Found PID 102378 for 
> /root/Impala-upstream/be/build/latest/service/catalogd -logbufsecs=5 -v=1 
> -max_log_files=10 -log_rotation_match_pid=true -log_filename=catalogd_n
> ode1 -log_dir=/tmp/ -kudu_master_hosts localhost --catalog_topic_mode=minimal 
> --catalogd_ha_reset_metadata_on_failover=false 
> --debug_actions=catalogd_event_processing_delay:SLEEP@1000 
> --enable_reload_events=true --
> warmup_tables_config_file=file:///root/Impala-upstream/testdata/data/warmup_test_config.txt
>  -catalog_service_port=26001 -state_store_subscriber_port=23021 
> -webserver_port=25021 -enable_catalogd_ha=true             
> -- 2025-07-31 17:10:01,172 INFO     MainThread: Killing <CatalogdProcess PID: 
> 102378 (/root/Impala-upstream/be/build/latest/service/catalogd -logbufsecs=5 
> -v=1 -max_log_files=10 -log_rotation_match_pid=true -log_fi
> lename=catalogd_node1 -log_dir=/tmp/ -kudu_master_hosts localhost 
> --catalog_topic_mode=minimal --catalogd_ha_reset_metadata_on_failover=false 
> --debug_actions=catalogd_event_processing_delay:SLEEP@1000 --enable_relo
> ad_events=true 
> --warmup_tables_config_file=file:///root/Impala-upstream/testdata/data/warmup_test_config.txt
>  -catalog_service_port=26001 -state_store_subscriber_port=23021 
> -webserver_port=25021 -enable_catalogd_ha=
> true)> with signal 9                                                          
>                                                                               
>                                                           
> -- 2025-07-31 17:10:01,177 INFO     MainThread: Getting metric: 
> catalog-server.active-status from quanlong-ha-test.vpc.cloudera.com:25020     
>              
> -- 2025-07-31 17:10:01,180 INFO     MainThread: Waiting for metric value 
> 'catalog-server.active-status'=True. Current value: False. total_wait: 0s
> -- 2025-07-31 17:10:01,180 INFO     MainThread: Sleeping 1s before next 
> retry.                                                                        
>                                                                 
> -- 2025-07-31 17:10:02,181 INFO     MainThread: Getting metric: 
> catalog-server.active-status from quanlong-ha-test.vpc.cloudera.com:25020     
>                                                                         
> -- 2025-07-31 17:10:02,184 INFO     MainThread: Metric 
> 'catalog-server.active-status' has reached desired value: True. total_wait: 
> 1.00396800041s
> -- 2025-07-31 17:10:02,188 INFO     MainThread: hs2: executing against Impala 
> at quanlong-ha-test.vpc.cloudera.com:21050. session: 
> ea4e2fca5d5b35f3:c51bd83bdb2417a9 main_cursor: True user: None
>                                                                               
>                                                                               
>                                                           
> show partitions warmup_test_db.tbl;
>                                                                               
>                                                                               
>                                                           
> -- 2025-07-31 17:10:03,737 INFO     MainThread: 
> 664e4d97f4712dfb:e4d0877500000000: query started           
> -- 2025-07-31 17:10:03,738 INFO     MainThread: 
> 664e4d97f4712dfb:e4d0877500000000: getting log for operation
> -- 2025-07-31 17:10:03,738 INFO     MainThread: 
> 664e4d97f4712dfb:e4d0877500000000: getting runtime profile operation   
> -- 2025-07-31 17:10:03,738 INFO     MainThread: 
> 664e4d97f4712dfb:e4d0877500000000: closing query for operation      
> -- 2025-07-31 17:10:03,739 INFO     MainThread: partition result: 
> ['Total\t\t-1\t0\t0B\t0B\t\t\t\t\t']{noformat}
> Profile of ShowPartitions indicates some stale metadata are used:
> {noformat}
>       Original Table Versions: warmup_test_db.tbl, 314, 1754007002807, Thu 
> Jul 31 17:10:02 PDT 2025
>        - CatalogFetch.ColumnStats.Hits: 13  <--- This shouldn't hit
>        - CatalogFetch.ColumnStats.Misses: 0
>        - CatalogFetch.ColumnStats.Requests: 13
>        - CatalogFetch.ColumnStats.Time: 0
>        - CatalogFetch.DatabaseList.Hits: 1
>        - CatalogFetch.DatabaseList.Misses: 0
>        - CatalogFetch.DatabaseList.Requests: 1
>        - CatalogFetch.DatabaseList.Time: 0
>        - CatalogFetch.PartitionLists.Hits: 1  <--- This shouldn't hit
>        - CatalogFetch.PartitionLists.Misses: 0
>        - CatalogFetch.PartitionLists.Requests: 1
>        - CatalogFetch.PartitionLists.Time: 0
>        - CatalogFetch.RPCs.Bytes: 1.78 KB (1818)
>        - CatalogFetch.RPCs.Requests: 2
>        - CatalogFetch.RPCs.Time: 1s531ms
>        - CatalogFetch.StorageLoad.Time: 22ms
>        - CatalogFetch.TableList.Hits: 1
>        - CatalogFetch.TableList.Misses: 1
>        - CatalogFetch.TableList.Requests: 2
>        - CatalogFetch.TableList.Time: 1s516ms
>        - CatalogFetch.Tables.Hits: 0
>        - CatalogFetch.Tables.Misses: 1
>        - CatalogFetch.Tables.Requests: 1
>        - CatalogFetch.Tables.Time: 15ms
>        - ExecutorGroupsConsidered: 1 (1){noformat}
> In logs of the new active catalogd (catalogd.INFO), the catalog version of 
> the table is 314:
> {noformat}
> I20250731 17:10:02.760269 103771 CatalogOpExecutor.java:4872] Found 1/1 
> partitions to add in table warmup_test_db.tbl from event 888
> I20250731 17:10:02.807034 103771 HdfsTable.java:781] Loaded file and block 
> metadata for warmup_test_db.tbl partitions: year=2025/month=1. Time taken: 
> 27.856ms
> I20250731 17:10:02.807621 103771 MetastoreEvents.java:860] EventId: 888 
> EventType: ADD_PARTITION Target: warmup_test_db.tbl. Successfully added 1 
> partitions to table warmup_test_db.tbl
> I20250731 17:10:02.807871 103771 MetastoreEventsProcessor.java:1578] Time 
> elapsed in processing event batch: 1s156ms
> I20250731 17:10:03.673661 103802 JniUtil.java:167] getEventProcessorSummary 
> request: Getting event processor summary
> I20250731 17:10:03.674260 103802 JniUtil.java:178] Finished 
> getEventProcessorSummary request: Getting event processor summary. Time 
> spent: 1ms
> I20250731 17:10:03.674295 103802 catalog-server.cc:978] EventProcessor is 
> synced up with HMS event id during failover: 888
> I20250731 17:10:03.674309 103802 catalog-server.cc:887] This catalogd 
> instance is changed to active status
> I20250731 17:10:03.677537 103773 JniUtil.java:167] getCatalogDelta request: 
> Getting catalog delta from version 0
> I20250731 17:10:03.687700 103773 catalog-server.cc:1615] Collected update: 
> 2:DATABASE:default, version=2, original size=48, compressed size=49
> ...
> I20250731 17:10:03.720110 103773 CatalogServiceCatalog.java:1965] Skipped 0 
> partitions of table warmup_test_db.tbl in the incremental update
> I20250731 17:10:03.720285 103773 CatalogServiceCatalog.java:1978] Collected 1 
> partition update(s): 2:HDFS_PARTITION:warmup_test_db.tbl:year=2025/month=1, 
> version=314, original size=100, compressed size=101
> I20250731 17:10:03.720324 103773 catalog-server.cc:1615] Collected update: 
> 2:TABLE:warmup_test_db.tbl, version=314, original size=65, compressed 
> size=66{noformat}
> In the previous catalogd logs (catalogd_node1.INFO), the table version before 
> AddPartition is also 314!
> {noformat}
> I20250731 17:09:58.553951 103647 HdfsTable.java:1256] Loading metadata for 
> table definition and all partition(s) of warmup_test_db.tbl (needed by 
> coordinator)
> I20250731 17:09:58.600651 103647 HdfsTable.java:1919] Loaded 13 columns from 
> HMS. Actual columns: 13
> I20250731 17:09:58.623989 103647 HdfsTable.java:2987] Load Valid Write Id 
> List Done. Time taken: 24.253us
> I20250731 17:09:58.624092 103647 HdfsTable.java:1303] Fetching partition 
> metadata from the Metastore: warmup_test_db.tbl
> I20250731 17:09:58.628580 103647 MetaStoreUtil.java:191] Fetching 0 
> partitions for: warmup_test_db.tbl using partition batch size: 1000
> I20250731 17:09:58.628685 103647 HdfsTable.java:1310] Fetched partition 
> metadata from the Metastore: warmup_test_db.tbl
> I20250731 17:09:58.649619 103647 HdfsTable.java:781] Loaded file and block 
> metadata for warmup_test_db.tbl partitions: . Time taken: 3.319ms
> I20250731 17:09:58.651077 103647 Table.java:1106] last refreshed event id for 
> table: warmup_test_db.tbl set to: -1
> I20250731 17:09:58.651185 103647 TableLoader.java:187] Loaded metadata for: 
> warmup_test_db.tbl (129ms)
> I20250731 17:09:58.685761 102954 JniUtil.java:167] getCatalogDelta request: 
> Getting catalog delta from version 313
> I20250731 17:09:58.693717 102954 CatalogServiceCatalog.java:1965] Skipped 0 
> partitions of table warmup_test_db.tbl in the incremental update
> I20250731 17:09:58.693792 102954 catalog-server.cc:1615] Collected update: 
> 2:TABLE:warmup_test_db.tbl, version=314, original size=65, compressed 
> size=66{noformat}
> So coordinator has loaded the empty partition list of it in cache. It's 
> associated with table name "warmup_test_db.tbl" and catalog version 314. Then 
> coincidently reused after catalogd warm failover.
> We should add the catalog service id in the getPartialCatalogObject response 
> and coordinator should clear the cache when the id changes, just like what we 
> do in updateCatalogCache() -> 
> [witnessCatalogServiceId()|https://github.com/apache/impala/blob/5414c3009890a736901b6ae3b56b9953b1992de7/fe/src/main/java/org/apache/impala/catalog/local/CatalogdMetaProvider.java#L1614].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to