[jira] [Commented] (IMPALA-12550) test_statestored_auto_failover_with_disabling_network flaky

ASF subversion and git services (Jira) Thu, 09 Nov 2023 17:10:06 -0800


    [ 
https://issues.apache.org/jira/browse/IMPALA-12550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17784641#comment-17784641
 ]


ASF subversion and git services commented on IMPALA-12550:
----------------------------------------------------------

Commit 7403d10a55397f81784f369aa501b9c072d198b6 in impala's branch 
refs/heads/master from wzhou-code
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=7403d10a5 ]

IMPALA-12550: Fix flaky test 
test_statestored_auto_failover_with_disabling_network

Test test_statestored_auto_failover_with_disabling_network failed
occasionally due to delay of HA Handshake or HA heartbeat RPCs between
two statestore instances. Sometimes the active statestore took a few
minutes to respond to the handshake requests from standby statestore.

This patch fixes the issue by not holding mutex ha_lock_ when sending
HA handshake and HA heartbeat. Redundant HA heartbeats are handled
on receiver side. Redundant HA handshakes are harmless.

Testing:
 - Repeatedly ran test_statestored_auto_failover_with_disabling_network
   on Jenkins for hundreds of times without failure.
 - Repeatedly ran test_statestored_auto_failover_with_disabling_network
   on local machine for thousand times without failure.
 - Repeatedly ran all tests in test_statestored_ha.py for over 12 hours
   on Jenkins without failure.
 - Passed core tests.

Change-Id: I515bbaaddfb4bf9bd2a39414cd6e3e4590dfbfb1
Reviewed-on: http://gerrit.cloudera.org:8080/20689
Reviewed-by: Riza Suminto <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> test_statestored_auto_failover_with_disabling_network flaky
> -----------------------------------------------------------
>
>                 Key: IMPALA-12550
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12550
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>    Affects Versions: Impala 4.4.0
>            Reporter: Wenzhe Zhou
>            Assignee: Wenzhe Zhou
>            Priority: Major
>
> TestStatestoredHA.test_statestored_auto_failover_with_disabling_network 
> failed with following stack trace when repeatedly run this test.
> tests/custom_cluster/test_statestored_ha.py:645: in 
> test_statestored_auto_failover_with_disabling_network
>     "statestore.in-ha-recovery-mode", expected_value=False, timeout=120)
> tests/common/impala_service.py:144: in wait_for_metric_value
>     self.__metric_timeout_assert(metric_name, expected_value, timeout)
> tests/common/impala_service.py:213: in __metric_timeout_assert
>     assert 0, assert_string
> E   AssertionError: Metric statestore.in-ha-recovery-mode did not reach value 
> False in 120s.
> From log messages, the issue was caused by the delay of HA Handshake RPC 
> between two statestore instances. Sometimes the active statestore took a few 
> minutes to response the handshake requests from standby statestore.
> This issue is different from IMPALA-12525, which was caused locking issue on 
> subscribers side.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-12550) test_statestored_auto_failover_with_disabling_network flaky

Reply via email to