Bankim Bhavsar has posted comments on this change. ( http://gerrit.cloudera.org:8080/16936 )
Change subject: [master] KUDU-2181 Raft ChangeConfig request to remove a master ...................................................................... Patch Set 1: (24 comments) http://gerrit.cloudera.org:8080/#/c/16936/1//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/16936/1//COMMIT_MSG@12 PS1, Line 12: tests > I'm curious what's the behavior of Kudu client when a master is removed und That's a good point and I guess applies to both addition and removal of masters but more so to removal of masters. Let me address and looking into it as a follow-up change. http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/dynamic_multi_master-test.cc File src/kudu/master/dynamic_multi_master-test.cc: http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/dynamic_multi_master-test.cc@254 PS1, Line 254: cluster > nit: it would be great to document the 'cluster' parameter as well in the c Done http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/dynamic_multi_master-test.cc@346 PS1, Line 346: if (master != HostPort()) { : *req.mutable_rpc_addr() = HostPortToPB(master); : } > What does it mean for this to be called with an unset 'master' field? Would Done http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/dynamic_multi_master-test.cc@474 PS1, Line 474: TransferMasterLeadershipAndVerify > nit: "verify" is a bit overloaded, especially when in a file that can misco Done http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/dynamic_multi_master-test.cc@787 PS1, Line 787: // When an ExternalMiniCluster is restarted after removal of a master then one of the : // remaining masters can get reassigned to the same wal dir which was previously assigned : // to the removed master. This causes problems during verification, so we always try to : // remove the last master in terms of index for test purposes. > Does this problem go away if, rather than creating a new mini cluster, we j ExternalMiniCluster interface doesn't allow updating the masters or more specifically the ExternalMiniClusterOptions on restart. Hence currently that's not an option. http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/dynamic_multi_master-test.cc@805 PS1, Line 805: // We need at least 1 operation for the leader in current term before initiating ChangeConfig. > I thought this happens by replicating a no-op when first becoming leader. O I did hit following error during development of the test and now that I recall it was intermittent. So there is a possibility of a race when the no-op doesn't finish before initiating the ChangeConfig request. "Leader has not yet committed an operation in its own term" https://github.com/apache/kudu/blob/master/src/kudu/consensus/raft_consensus.cc#L1896 I'm not sure about a good way to wait/verify that the NO-OP operation has completed and it's safe to proceed with Remove master ChangeConfig. Table creation helps with verification later so keeping this code as-is and added a comment. http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/dynamic_multi_master-test.cc@814 PS1, Line 814: ASSERT_EVENTUALLY([&] { : VerifyDeadMasterInSpecifiedState(master_to_remove_uuid, consensus::HealthReportPB::UNKNOWN); : }); > It seems possible we'd trigger a test failure if, by luck, the gaps in ASSE Done http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/dynamic_multi_master-test.cc@834 PS1, Line 834: ASSERT_FALSE(ContainsKey(actual_master_hps, master_to_remove)); > nit: while I appreciate the dedicated checking, especially when it comes to Done http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/dynamic_multi_master-test.cc@853 PS1, Line 853: cluster_.reset(); > Does ExternalMiniCluster interface doesn't allow updating the masters or more specifically the ExternalMiniClusterOptions on restart. http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/dynamic_multi_master-test.cc@930 PS1, Line 930: /* Omitting "--master_support_change_config" */ > nit: maybe just set it to false now so we don't have to update this test wh Done http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/dynamic_multi_master-test.cc@938 PS1, Line 938: non_leader_master_idx > Does it matter whether it's leader or non-leader master index? I guess if Good point. Added checks for both leader and non-leader masters. http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/dynamic_multi_master-test.cc@995 PS1, Line 995: It's possible there is a leadership change between querying for leader master and : // sending the add master request to non-leader master and hence using ASSERT_EVENTUALLY. > In tablet server tests there is another way to protect against this -- disa Using higher value of --leader_failure_max_missed_heartbeat_periods to avoid leader master transfers. I tried --enable_leader_failure_detection=false for masters but leads to issues in electing a master on cluster startup. I didn't dive deeper. http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/dynamic_multi_master-test.cc@1009 PS1, Line 1009: ASSERT_OK(cluster_->master_proxy(non_leader_master_idx)->RemoveMaster(req, &resp, &rpc)); > Yep, I'm curious whether this might end up in removing one master, and then Scenario mentioned by Alexey would be possible. Using higher value of --leader_failure_max_missed_heartbeat_periods to avoid leader master transfers. I tried --enable_leader_failure_detection=false for masters but leads to issues in electing a master on cluster startup. I didn't dive deeper. http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/dynamic_multi_master-test.cc@1019 PS1, Line 1019: TEST_F(DynamicMultiMasterTest, TestRemoveLeaderMaster) { > Does it make sense to add a simple scenario to verify that a single and the Done http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/master.h File src/kudu/master/master.h: http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/master.h@131 PS1, Line 131: of > nit: of initiating ? Done http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/master.cc File src/kudu/master/master.cc: http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/master.cc@523 PS1, Line 523: if (HostPortFromPB(peer.last_known_addr()) == hp > Could there be more than one master satisfying this criterion? If so, what Updated the code to catch that condition and return an error back. http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/master.cc@532 PS1, Line 532: Substitute > nit: nothing being substituted? Maybe add the leader's UUID? Done http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/master.cc@532 PS1, Line 532: leader master > nit: I guess it might be not only the leader, right? Maybe, just report th Done http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/master.proto File src/kudu/master/master.proto: http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/master.proto@941 PS1, Line 941: HostPortPB > Why HostPostPB instead of UUID as an identifier of existing server is bette For adding a master makes sense to use HostPortPB and removing better to use the unique id. Changed to using uuid instead. http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/master_service.cc File src/kudu/master/master_service.cc: http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/master_service.cc@277 PS1, Line 277: if (!FLAGS_master_support_change_config) { : rpc->RespondFailure(Status::NotSupported("Removing master is not supported")); : return; : } > Hrm, can you remind me why it's insufficient to rely on the SupportsFeature Pasting the comment added. // This feature flag is part of SupportsFeature() function in master service but it's possible // that a client may not specify the feature flag. // This check protects access to the server feature irrespective of whether the // feature flag is specified in the client. http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/mini-cluster/external_mini_cluster.h File src/kudu/mini-cluster/external_mini_cluster.h: http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/mini-cluster/external_mini_cluster.h@483 PS1, Line 483: // Removes the master from the ExternalMiniCluster when the master has been removed : // dynamically after bringing up the ExternalMiniCluster. > nit: just reading this, a few things aren't immediately clear: Added a clarifying comment for #1. Will address the client part in a follow-up change. http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/mini-cluster/external_mini_cluster.cc File src/kudu/mini-cluster/external_mini_cluster.cc: http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/mini-cluster/external_mini_cluster.cc@964 PS1, Line 964: ) { > nit: Why not ++it here? Seems more idiomatic. It's better to not increment iterator here when iterating over a collection and deleting an entry as erase() returns the next iterator. Hence placed it below. However in this case we don't iterate once the desired entry is found and deleted, so I guess okay to move the iterator increment here. http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/mini-cluster/external_mini_cluster.cc@966 PS1, Line 966: m > nit: if this is only used once, maybe just inline it as (*it)->bound_rpc_ho Done http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/mini-cluster/external_mini_cluster.cc@972 PS1, Line 972: : r > nit: drop the extra line Done -- To view, visit http://gerrit.cloudera.org:8080/16936 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I76c03b8850faef60b65f85184c0a4db7cc6759ee Gerrit-Change-Number: 16936 Gerrit-PatchSet: 1 Gerrit-Owner: Bankim Bhavsar <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-Reviewer: Bankim Bhavsar <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Comment-Date: Mon, 25 Jan 2021 23:42:30 +0000 Gerrit-HasComments: Yes
