Wenzhe Zhou has uploaded a new patch set (#7). ( http://gerrit.cloudera.org:8080/20657 )
Change subject: IMPALA-12525: Fix flaky test test_statestored_manual_failover ...................................................................... IMPALA-12525: Fix flaky test test_statestored_manual_failover In test_statestored_manual_failover, statestore service failover is not triggered sometimes when the network of active statestored is disabled after manually forced failover. During test, the network of active statestored could be disabled before all subscribers re-registered with restarted statestored. This caused some subscribers to not receive the notification of active statestored change so that they could not correctly report connection states for the requests from standby statestored. This patch made following changes: 1) Updated the test case test_statestored_manual_failover to disable the network of active statestored after all subscribers re-registering with the restarted statestored. 2) Defined a new mutex active_lock_ in class StatestoreStub to protect is_active_ since the mutex lock_ could be held for long time if the subscriber lose the connection with statestored and enter recovery mode. 3) Found one case that was not handled on Statestore subscribers. The subscribers could be started before both statestore instances are ready to accept registration requests. This caused impalad hit DCHECK. Changed code to handle this case in this patch. Added test cases to inject a real delay in statestored startup and verify impalads and catalogd are able to tolerate this delay. 4) Updated address of active catalogd in the metrics of statestored after statestore service failover. 5) Another test test_statestored_auto_failover_with_disabling_network failed occasionally due to delay of HA Handshake RPC between two statestore instances. The issue is tracked with IMPALA-12550. The last two lines of the test are commented out temporarily. Testing: - Repeatedly ran test_statestored_manual_failover on Jenkins for hundreds of times. - Repeatedly ran test_statestored_manual_failover on local machine for thousand times without failure. - Passed core tests Change-Id: If03bf09d22a2875d2c1eec8a4f62eeefc5d855dc --- M be/src/statestore/statestore-subscriber.cc M be/src/statestore/statestore-subscriber.h M be/src/statestore/statestore.cc M tests/custom_cluster/test_statestored_ha.py 4 files changed, 150 insertions(+), 40 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/57/20657/7 -- To view, visit http://gerrit.cloudera.org:8080/20657 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: If03bf09d22a2875d2c1eec8a4f62eeefc5d855dc Gerrit-Change-Number: 20657 Gerrit-PatchSet: 7 Gerrit-Owner: Wenzhe Zhou <[email protected]> Gerrit-Reviewer: Abhishek Rawat <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Michael Smith <[email protected]> Gerrit-Reviewer: Riza Suminto <[email protected]> Gerrit-Reviewer: Wenzhe Zhou <[email protected]> Gerrit-Reviewer: Yida Wu <[email protected]>
