[jira] [Created] (IMPALA-14266) test_catalogd_manual_failover_with_failed_rpc is flaky in __verify_impalad_active_catalogd_port

Quanlong Huang (Jira) Tue, 22 Jul 2025 18:56:30 -0700

Quanlong Huang created IMPALA-14266:
---------------------------------------


             Summary: test_catalogd_manual_failover_with_failed_rpc is flaky in 
__verify_impalad_active_catalogd_port
                 Key: IMPALA-14266
                 URL: https://issues.apache.org/jira/browse/IMPALA-14266
             Project: IMPALA
          Issue Type: Bug
          Components: Test
            Reporter: Quanlong Huang


Saw a failure in TestCatalogdHA.test_catalogd_manual_failover_with_failed_rpc 
on a private branch:
{code:python}
custom_cluster/test_catalogd_ha.py:393: in 
test_catalogd_manual_failover_with_failed_rpc
    self.__test_catalogd_manual_failover(unique_database)
custom_cluster/test_catalogd_ha.py:333: in __test_catalogd_manual_failover
    self.__verify_impalad_active_catalogd_port(0, catalogd_service_2)
custom_cluster/test_catalogd_ha.py:84: in __verify_impalad_active_catalogd_port
    assert int(catalog_service_port) == 
catalogd_service.get_catalog_service_port()
E   assert 26000 == 26001
E    +  where 26000 = int('26000')
E    +  and   26001 = <bound method CatalogdService.get_catalog_service_port of 
<tests.common.impala_service.CatalogdService object at 0x7fdd80eb2b90>>()
E    +    where <bound method CatalogdService.get_catalog_service_port of 
<tests.common.impala_service.CatalogdService object at 0x7fdd80eb2b90>> = 
<tests.common.impala_service.CatalogdService object at 
0x7fdd80eb2b90>.get_catalog_service_port{code}
The test code assumes that after the new active catalogd updates its 
"active-status" metric, both statestore and all coordinators should have 
updated their "active-catalogd-address":
{code:python}
    # Kill active catalogd
    active_catalogd.kill()

    # Wait for long enough for the statestore to detect the failure of active 
catalogd
    # and assign active role to standby catalogd.
    catalogd_service_2.wait_for_metric_value(
        "catalog-server.active-status", expected_value=True, timeout=30)
    assert catalogd_service_2.get_metric_value(
        "catalog-server.ha-number-active-status-change") > 0
    assert catalogd_service_2.get_metric_value("catalog-server.active-status")

    # Verify ports of the active catalogd of statestore and impalad are 
matching with
    # the catalog service port of the current active catalogd.
    self.__verify_statestore_active_catalogd_port(catalogd_service_2)
    self.__verify_impalad_active_catalogd_port(0, catalogd_service_2) # <---- 
Failed here
    self.__verify_impalad_active_catalogd_port(1, catalogd_service_2)
    self.__verify_impalad_active_catalogd_port(2, catalogd_service_2){code}
I think that's not true since one subscriber (the new active catalogd) has 
processed the update doesn't guarantee all other subscribers also have 
processed the update. We should wait for their metrics to be updated with a 
timeout.

CC [~wzhou], [~rizaon] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (IMPALA-14266) test_catalogd_manual_failover_with_failed_rpc is flaky in __verify_impalad_active_catalogd_port

Reply via email to