Michael Smith has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/20372 )

Change subject: IMPALA-12156: Support High Availability for Statestore
......................................................................


Patch Set 11:

(14 comments)

I've made it about halfway, posting some questions.

http://gerrit.cloudera.org:8080/#/c/20372/11//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/20372/11//COMMIT_MSG@31
PS11, Line 31:   Standby statestored saves the connection state as failure 
detecer.
nit: detector


http://gerrit.cloudera.org:8080/#/c/20372/11//COMMIT_MSG@38
PS11, Line 38:   active staetstored.
nit: statestored


http://gerrit.cloudera.org:8080/#/c/20372/11//COMMIT_MSG@48
PS11, Line 48: - Subscribers switch to new active statstore when receiving RPC
nit: statestore


http://gerrit.cloudera.org:8080/#/c/20372/11/be/src/statestore/statestore.h
File be/src/statestore/statestore.h:

http://gerrit.cloudera.org:8080/#/c/20372/11/be/src/statestore/statestore.h@137
PS11, Line 137: /// the subscribers. Active statetsored also sends heartbeats 
to standby statestored.
nit: statestored


http://gerrit.cloudera.org:8080/#/c/20372/11/be/src/statestore/statestore.h@140
PS11, Line 140: /// heartbeats from active statestored.
We could have a failure scenario where active statestored is still sending 
heartbeats to standby, but majority of subscribers are unable to reach the 
active statestored (and only standby). That seems like an acceptable exception 
where manual intervention would be needed.


http://gerrit.cloudera.org:8080/#/c/20372/11/be/src/statestore/statestore.h@782
PS11, Line 782:   int64_t active_version_ = 0;
Why is this based on time rather than an incrementing counter? It looks like 
it's just used for causal ordering.

It would help to describe the purpose of this variable.


http://gerrit.cloudera.org:8080/#/c/20372/11/be/src/statestore/statestore.cc
File be/src/statestore/statestore.cc:

http://gerrit.cloudera.org:8080/#/c/20372/11/be/src/statestore/statestore.cc@654
PS11, Line 654:   // Set least significant 64 bits of statestore_id as the 
number of microseconds that
Doc string for FLAGS_use_network_address_as_statestored_priority says if false 
it'll use a random value, but this isn't really random.

In most cases whichever statestore was started last will have priority. It's 
not clear to me whether that will cause the newer statestore to claim primary. 
Does the negotiation only happen if the older instance is a standby (i.e. both 
startup in a similar time window, and not yet timeout on the initial handshake)?


http://gerrit.cloudera.org:8080/#/c/20372/11/be/src/statestore/statestore.cc@1248
PS11, Line 1248:         ++failed_conn_state_count_;
Can we add a DCHECK for what we expect the old state to be?


http://gerrit.cloudera.org:8080/#/c/20372/11/be/src/statestore/statestore.cc@1250
PS11, Line 1250:         --failed_conn_state_count_;
Can we add a DCHECK for what we expect the new state to be?


http://gerrit.cloudera.org:8080/#/c/20372/11/be/src/statestore/statestore.cc@1251
PS11, Line 1251:       }
What are the other options here? Let's enforce them via DCHECK.


http://gerrit.cloudera.org:8080/#/c/20372/11/be/src/statestore/statestore.cc@1573
PS11, Line 1573:       update_statestored_cv_.WaitFor(l, timeout_us);
This combination means that we'll only be able to process other HA updates when 
the timeout occurs, right?


http://gerrit.cloudera.org:8080/#/c/20372/11/be/src/statestore/statestore.cc@1654
PS11, Line 1654:           if (status.code() == TErrorCode::RPC_RECV_TIMEOUT) {
nit: This and conditional line above could be combined to reduce indentation.


http://gerrit.cloudera.org:8080/#/c/20372/11/be/src/statestore/statestore.cc@1792
PS11, Line 1792:     // statestored designates itself as active if the 
statestore can not connect to the
What happens if both instances do this?


http://gerrit.cloudera.org:8080/#/c/20372/11/be/src/statestore/statestore.cc@1963
PS11, Line 1963:         // TODO wzhou How to handle split-brain in network?
It looks like you outlined a plan here but haven't yet implemented it?



--
To view, visit http://gerrit.cloudera.org:8080/20372
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ibd2c814bbad5c04c1d50c2edaa5b910c82a6fd87
Gerrit-Change-Number: 20372
Gerrit-PatchSet: 11
Gerrit-Owner: Wenzhe Zhou <[email protected]>
Gerrit-Reviewer: Abhishek Rawat <[email protected]>
Gerrit-Reviewer: Andrew Sherman <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Michael Smith <[email protected]>
Gerrit-Reviewer: Wenzhe Zhou <[email protected]>
Gerrit-Reviewer: Yida Wu <[email protected]>
Gerrit-Comment-Date: Thu, 21 Sep 2023 23:53:34 +0000
Gerrit-HasComments: Yes

Reply via email to