Quanlong Huang created IMPALA-14266: ---------------------------------------
Summary: test_catalogd_manual_failover_with_failed_rpc is flaky in __verify_impalad_active_catalogd_port Key: IMPALA-14266 URL: https://issues.apache.org/jira/browse/IMPALA-14266 Project: IMPALA Issue Type: Bug Components: Test Reporter: Quanlong Huang Saw a failure in TestCatalogdHA.test_catalogd_manual_failover_with_failed_rpc on a private branch: {code:python} custom_cluster/test_catalogd_ha.py:393: in test_catalogd_manual_failover_with_failed_rpc self.__test_catalogd_manual_failover(unique_database) custom_cluster/test_catalogd_ha.py:333: in __test_catalogd_manual_failover self.__verify_impalad_active_catalogd_port(0, catalogd_service_2) custom_cluster/test_catalogd_ha.py:84: in __verify_impalad_active_catalogd_port assert int(catalog_service_port) == catalogd_service.get_catalog_service_port() E assert 26000 == 26001 E + where 26000 = int('26000') E + and 26001 = <bound method CatalogdService.get_catalog_service_port of <tests.common.impala_service.CatalogdService object at 0x7fdd80eb2b90>>() E + where <bound method CatalogdService.get_catalog_service_port of <tests.common.impala_service.CatalogdService object at 0x7fdd80eb2b90>> = <tests.common.impala_service.CatalogdService object at 0x7fdd80eb2b90>.get_catalog_service_port{code} The test code assumes that after the new active catalogd updates its "active-status" metric, both statestore and all coordinators should have updated their "active-catalogd-address": {code:python} # Kill active catalogd active_catalogd.kill() # Wait for long enough for the statestore to detect the failure of active catalogd # and assign active role to standby catalogd. catalogd_service_2.wait_for_metric_value( "catalog-server.active-status", expected_value=True, timeout=30) assert catalogd_service_2.get_metric_value( "catalog-server.ha-number-active-status-change") > 0 assert catalogd_service_2.get_metric_value("catalog-server.active-status") # Verify ports of the active catalogd of statestore and impalad are matching with # the catalog service port of the current active catalogd. self.__verify_statestore_active_catalogd_port(catalogd_service_2) self.__verify_impalad_active_catalogd_port(0, catalogd_service_2) # <---- Failed here self.__verify_impalad_active_catalogd_port(1, catalogd_service_2) self.__verify_impalad_active_catalogd_port(2, catalogd_service_2){code} I think that's not true since one subscriber (the new active catalogd) has processed the update doesn't guarantee all other subscribers also have processed the update. We should wait for their metrics to be updated with a timeout. CC [~wzhou], [~rizaon] -- This message was sent by Atlassian Jira (v8.20.10#820010)