Michael Smith has posted comments on this change. ( http://gerrit.cloudera.org:8080/20372 )
Change subject: IMPALA-12156: Support High Availability for Statestore ...................................................................... Patch Set 11: (14 comments) I've made it about halfway, posting some questions. http://gerrit.cloudera.org:8080/#/c/20372/11//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/20372/11//COMMIT_MSG@31 PS11, Line 31: Standby statestored saves the connection state as failure detecer. nit: detector http://gerrit.cloudera.org:8080/#/c/20372/11//COMMIT_MSG@38 PS11, Line 38: active staetstored. nit: statestored http://gerrit.cloudera.org:8080/#/c/20372/11//COMMIT_MSG@48 PS11, Line 48: - Subscribers switch to new active statstore when receiving RPC nit: statestore http://gerrit.cloudera.org:8080/#/c/20372/11/be/src/statestore/statestore.h File be/src/statestore/statestore.h: http://gerrit.cloudera.org:8080/#/c/20372/11/be/src/statestore/statestore.h@137 PS11, Line 137: /// the subscribers. Active statetsored also sends heartbeats to standby statestored. nit: statestored http://gerrit.cloudera.org:8080/#/c/20372/11/be/src/statestore/statestore.h@140 PS11, Line 140: /// heartbeats from active statestored. We could have a failure scenario where active statestored is still sending heartbeats to standby, but majority of subscribers are unable to reach the active statestored (and only standby). That seems like an acceptable exception where manual intervention would be needed. http://gerrit.cloudera.org:8080/#/c/20372/11/be/src/statestore/statestore.h@782 PS11, Line 782: int64_t active_version_ = 0; Why is this based on time rather than an incrementing counter? It looks like it's just used for causal ordering. It would help to describe the purpose of this variable. http://gerrit.cloudera.org:8080/#/c/20372/11/be/src/statestore/statestore.cc File be/src/statestore/statestore.cc: http://gerrit.cloudera.org:8080/#/c/20372/11/be/src/statestore/statestore.cc@654 PS11, Line 654: // Set least significant 64 bits of statestore_id as the number of microseconds that Doc string for FLAGS_use_network_address_as_statestored_priority says if false it'll use a random value, but this isn't really random. In most cases whichever statestore was started last will have priority. It's not clear to me whether that will cause the newer statestore to claim primary. Does the negotiation only happen if the older instance is a standby (i.e. both startup in a similar time window, and not yet timeout on the initial handshake)? http://gerrit.cloudera.org:8080/#/c/20372/11/be/src/statestore/statestore.cc@1248 PS11, Line 1248: ++failed_conn_state_count_; Can we add a DCHECK for what we expect the old state to be? http://gerrit.cloudera.org:8080/#/c/20372/11/be/src/statestore/statestore.cc@1250 PS11, Line 1250: --failed_conn_state_count_; Can we add a DCHECK for what we expect the new state to be? http://gerrit.cloudera.org:8080/#/c/20372/11/be/src/statestore/statestore.cc@1251 PS11, Line 1251: } What are the other options here? Let's enforce them via DCHECK. http://gerrit.cloudera.org:8080/#/c/20372/11/be/src/statestore/statestore.cc@1573 PS11, Line 1573: update_statestored_cv_.WaitFor(l, timeout_us); This combination means that we'll only be able to process other HA updates when the timeout occurs, right? http://gerrit.cloudera.org:8080/#/c/20372/11/be/src/statestore/statestore.cc@1654 PS11, Line 1654: if (status.code() == TErrorCode::RPC_RECV_TIMEOUT) { nit: This and conditional line above could be combined to reduce indentation. http://gerrit.cloudera.org:8080/#/c/20372/11/be/src/statestore/statestore.cc@1792 PS11, Line 1792: // statestored designates itself as active if the statestore can not connect to the What happens if both instances do this? http://gerrit.cloudera.org:8080/#/c/20372/11/be/src/statestore/statestore.cc@1963 PS11, Line 1963: // TODO wzhou How to handle split-brain in network? It looks like you outlined a plan here but haven't yet implemented it? -- To view, visit http://gerrit.cloudera.org:8080/20372 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: Ibd2c814bbad5c04c1d50c2edaa5b910c82a6fd87 Gerrit-Change-Number: 20372 Gerrit-PatchSet: 11 Gerrit-Owner: Wenzhe Zhou <[email protected]> Gerrit-Reviewer: Abhishek Rawat <[email protected]> Gerrit-Reviewer: Andrew Sherman <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Michael Smith <[email protected]> Gerrit-Reviewer: Wenzhe Zhou <[email protected]> Gerrit-Reviewer: Yida Wu <[email protected]> Gerrit-Comment-Date: Thu, 21 Sep 2023 23:53:34 +0000 Gerrit-HasComments: Yes
