[jira] [Updated] (IMPALA-14266) test_catalogd_manual_failover_with_failed_rpc is flaky in __verify_impalad_active_catalogd_port

Quanlong Huang (Jira) Tue, 22 Jul 2025 18:57:59 -0700


     [ 
https://issues.apache.org/jira/browse/IMPALA-14266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Quanlong Huang updated IMPALA-14266:
------------------------------------
    Attachment: test_catalogd_manual_failover_with_failed_rpc_logs.tar.gz

> test_catalogd_manual_failover_with_failed_rpc is flaky in 
> __verify_impalad_active_catalogd_port
> -----------------------------------------------------------------------------------------------
>
>                 Key: IMPALA-14266
>                 URL: https://issues.apache.org/jira/browse/IMPALA-14266
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Test
>            Reporter: Quanlong Huang
>            Priority: Major
>         Attachments: test_catalogd_manual_failover_with_failed_rpc_logs.tar.gz
>
>
> Saw a failure in TestCatalogdHA.test_catalogd_manual_failover_with_failed_rpc 
> on a private branch:
> {code:python}
> custom_cluster/test_catalogd_ha.py:393: in 
> test_catalogd_manual_failover_with_failed_rpc
>     self.__test_catalogd_manual_failover(unique_database)
> custom_cluster/test_catalogd_ha.py:333: in __test_catalogd_manual_failover
>     self.__verify_impalad_active_catalogd_port(0, catalogd_service_2)
> custom_cluster/test_catalogd_ha.py:84: in 
> __verify_impalad_active_catalogd_port
>     assert int(catalog_service_port) == 
> catalogd_service.get_catalog_service_port()
> E   assert 26000 == 26001
> E    +  where 26000 = int('26000')
> E    +  and   26001 = <bound method CatalogdService.get_catalog_service_port 
> of <tests.common.impala_service.CatalogdService object at 0x7fdd80eb2b90>>()
> E    +    where <bound method CatalogdService.get_catalog_service_port of 
> <tests.common.impala_service.CatalogdService object at 0x7fdd80eb2b90>> = 
> <tests.common.impala_service.CatalogdService object at 
> 0x7fdd80eb2b90>.get_catalog_service_port{code}
> The test code assumes that after the new active catalogd updates its 
> "active-status" metric, both statestore and all coordinators should have 
> updated their "active-catalogd-address":
> {code:python}
>     # Kill active catalogd
>     active_catalogd.kill()
>     # Wait for long enough for the statestore to detect the failure of active 
> catalogd
>     # and assign active role to standby catalogd.
>     catalogd_service_2.wait_for_metric_value(
>         "catalog-server.active-status", expected_value=True, timeout=30)
>     assert catalogd_service_2.get_metric_value(
>         "catalog-server.ha-number-active-status-change") > 0
>     assert catalogd_service_2.get_metric_value("catalog-server.active-status")
>     # Verify ports of the active catalogd of statestore and impalad are 
> matching with
>     # the catalog service port of the current active catalogd.
>     self.__verify_statestore_active_catalogd_port(catalogd_service_2)
>     self.__verify_impalad_active_catalogd_port(0, catalogd_service_2) # <---- 
> Failed here
>     self.__verify_impalad_active_catalogd_port(1, catalogd_service_2)
>     self.__verify_impalad_active_catalogd_port(2, catalogd_service_2){code}
> I think that's not true since one subscriber (the new active catalogd) has 
> processed the update doesn't guarantee all other subscribers also have 
> processed the update. We should wait for their metrics to be updated with a 
> timeout.
> CC [~wzhou], [~rizaon] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (IMPALA-14266) test_catalogd_manual_failover_with_failed_rpc is flaky in __verify_impalad_active_catalogd_port

Reply via email to