[Impala-ASF-CR] IMPALA-13159: Fix query cancellation caused by statestore failover

Riza Suminto (Code Review) Fri, 14 Jun 2024 11:17:13 -0700

Riza Suminto has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/21520 )


Change subject: IMPALA-13159: Fix query cancellation caused by statestore 
failover
......................................................................


Patch Set 1:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/21520/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/21520/1//COMMIT_MSG@12
PS1, Line 12: This patch adds code to handle inconsistent
            : cluster membership state after statestore failover in the same 
way.
Please clarify what is the expected behavior. When is query should cancel and 
not cancel?
Is both failover scenario and regular StateStore restart scenario should not 
cancel RUNNING query?


http://gerrit.cloudera.org:8080/#/c/21520/1/be/src/statestore/statestore-subscriber.cc
File be/src/statestore/statestore-subscriber.cc:

http://gerrit.cloudera.org:8080/#/c/21520/1/be/src/statestore/statestore-subscriber.cc@1095
PS1, Line 1095: bool has_failed_before = connection_failure_metric_->GetValue() 
> 0;
              :   bool in_failed_grace_period = 
MilliSecondsSinceLastRegistration()
              :       < FLAGS_statestore_subscriber_recovery_grace_period_ms;
I think 'has_disconnect_before' and 'in_disconnect_grace_period' is a better 
name to distinguish from failover scenario.

My understanding is that, disconnect only means temporary network issue to 
StateStore, and is not changing from active StateStore to standby StateStore.


http://gerrit.cloudera.org:8080/#/c/21520/1/tests/custom_cluster/test_statestored_ha.py
File tests/custom_cluster/test_statestored_ha.py:

http://gerrit.cloudera.org:8080/#/c/21520/1/tests/custom_cluster/test_statestored_ha.py@677
PS1, Line 677: # Wait for long enough for the standby statestored to detect the 
failure of active
             :       # statestored and assign itself in active role.
Is this 2s (statestore_peer_timeout_seconds)?


http://gerrit.cloudera.org:8080/#/c/21520/1/tests/custom_cluster/test_statestored_ha.py@688
PS1, Line 688: # Now kill a backend, and make sure the query fails.
Why not let query finish?
I thought this test want to show that query completes and not impacted by 
StateStore failover.



--
To view, visit http://gerrit.cloudera.org:8080/21520
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I720bec5199df46475b954558abb0637ca7e6298b
Gerrit-Change-Number: 21520
Gerrit-PatchSet: 1
Gerrit-Owner: Wenzhe Zhou <[email protected]>
Gerrit-Reviewer: Abhishek Rawat <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Riza Suminto <[email protected]>
Gerrit-Comment-Date: Fri, 14 Jun 2024 18:12:59 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-13159: Fix query cancellation caused by statestore failover

Reply via email to