[
https://issues.apache.org/jira/browse/IMPALA-14266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Quanlong Huang updated IMPALA-14266:
------------------------------------
Attachment: test_catalogd_manual_failover_with_failed_rpc_logs.tar.gz
> test_catalogd_manual_failover_with_failed_rpc is flaky in
> __verify_impalad_active_catalogd_port
> -----------------------------------------------------------------------------------------------
>
> Key: IMPALA-14266
> URL: https://issues.apache.org/jira/browse/IMPALA-14266
> Project: IMPALA
> Issue Type: Bug
> Components: Test
> Reporter: Quanlong Huang
> Priority: Major
> Attachments: test_catalogd_manual_failover_with_failed_rpc_logs.tar.gz
>
>
> Saw a failure in TestCatalogdHA.test_catalogd_manual_failover_with_failed_rpc
> on a private branch:
> {code:python}
> custom_cluster/test_catalogd_ha.py:393: in
> test_catalogd_manual_failover_with_failed_rpc
> self.__test_catalogd_manual_failover(unique_database)
> custom_cluster/test_catalogd_ha.py:333: in __test_catalogd_manual_failover
> self.__verify_impalad_active_catalogd_port(0, catalogd_service_2)
> custom_cluster/test_catalogd_ha.py:84: in
> __verify_impalad_active_catalogd_port
> assert int(catalog_service_port) ==
> catalogd_service.get_catalog_service_port()
> E assert 26000 == 26001
> E + where 26000 = int('26000')
> E + and 26001 = <bound method CatalogdService.get_catalog_service_port
> of <tests.common.impala_service.CatalogdService object at 0x7fdd80eb2b90>>()
> E + where <bound method CatalogdService.get_catalog_service_port of
> <tests.common.impala_service.CatalogdService object at 0x7fdd80eb2b90>> =
> <tests.common.impala_service.CatalogdService object at
> 0x7fdd80eb2b90>.get_catalog_service_port{code}
> The test code assumes that after the new active catalogd updates its
> "active-status" metric, both statestore and all coordinators should have
> updated their "active-catalogd-address":
> {code:python}
> # Kill active catalogd
> active_catalogd.kill()
> # Wait for long enough for the statestore to detect the failure of active
> catalogd
> # and assign active role to standby catalogd.
> catalogd_service_2.wait_for_metric_value(
> "catalog-server.active-status", expected_value=True, timeout=30)
> assert catalogd_service_2.get_metric_value(
> "catalog-server.ha-number-active-status-change") > 0
> assert catalogd_service_2.get_metric_value("catalog-server.active-status")
> # Verify ports of the active catalogd of statestore and impalad are
> matching with
> # the catalog service port of the current active catalogd.
> self.__verify_statestore_active_catalogd_port(catalogd_service_2)
> self.__verify_impalad_active_catalogd_port(0, catalogd_service_2) # <----
> Failed here
> self.__verify_impalad_active_catalogd_port(1, catalogd_service_2)
> self.__verify_impalad_active_catalogd_port(2, catalogd_service_2){code}
> I think that's not true since one subscriber (the new active catalogd) has
> processed the update doesn't guarantee all other subscribers also have
> processed the update. We should wait for their metrics to be updated with a
> timeout.
> CC [~wzhou], [~rizaon]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]