Bankim Bhavsar has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/16936 )

Change subject: [master] KUDU-2181 Raft ChangeConfig request to remove a master
......................................................................


Patch Set 1:

(24 comments)

http://gerrit.cloudera.org:8080/#/c/16936/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16936/1//COMMIT_MSG@12
PS1, Line 12: tests
> I'm curious what's the behavior of Kudu client when a master is removed und
That's a good point and I guess applies to both addition and removal of masters 
but more so to removal of masters.
Let me address and looking into it as a follow-up change.


http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/dynamic_multi_master-test.cc
File src/kudu/master/dynamic_multi_master-test.cc:

http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/dynamic_multi_master-test.cc@254
PS1, Line 254: cluster
> nit: it would be great to document the 'cluster' parameter as well in the c
Done


http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/dynamic_multi_master-test.cc@346
PS1, Line 346:       if (master != HostPort()) {
             :         *req.mutable_rpc_addr() = HostPortToPB(master);
             :       }
> What does it mean for this to be called with an unset 'master' field? Would
Done


http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/dynamic_multi_master-test.cc@474
PS1, Line 474: TransferMasterLeadershipAndVerify
> nit: "verify" is a bit overloaded, especially when in a file that can misco
Done


http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/dynamic_multi_master-test.cc@787
PS1, Line 787: // When an ExternalMiniCluster is restarted after removal of a 
master then one of the
             :   // remaining masters can get reassigned to the same wal dir 
which was previously assigned
             :   // to the removed master. This causes problems during 
verification, so we always try to
             :   // remove the last master in terms of index for test purposes.
> Does this problem go away if, rather than creating a new mini cluster, we j
ExternalMiniCluster interface doesn't allow updating the masters or more 
specifically the ExternalMiniClusterOptions on restart. Hence currently that's 
not an option.


http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/dynamic_multi_master-test.cc@805
PS1, Line 805:   // We need at least 1 operation for the leader in current term 
before initiating ChangeConfig.
> I thought this happens by replicating a no-op when first becoming leader. O
I did hit following error during development of the test and now that I recall 
it was intermittent. So there is a possibility of a race when the no-op doesn't 
finish before initiating the ChangeConfig request.

"Leader has not yet committed an operation in its own term"
https://github.com/apache/kudu/blob/master/src/kudu/consensus/raft_consensus.cc#L1896

I'm not sure about a good way to wait/verify that the NO-OP operation has 
completed and it's safe to proceed with Remove master ChangeConfig.
Table creation helps with verification later so keeping this code as-is and 
added a comment.


http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/dynamic_multi_master-test.cc@814
PS1, Line 814:    ASSERT_EVENTUALLY([&] {
             :       VerifyDeadMasterInSpecifiedState(master_to_remove_uuid, 
consensus::HealthReportPB::UNKNOWN);
             :     });
> It seems possible we'd trigger a test failure if, by luck, the gaps in ASSE
Done


http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/dynamic_multi_master-test.cc@834
PS1, Line 834: ASSERT_FALSE(ContainsKey(actual_master_hps, master_to_remove));
> nit: while I appreciate the dedicated checking, especially when it comes to
Done


http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/dynamic_multi_master-test.cc@853
PS1, Line 853:   cluster_.reset();
> Does
ExternalMiniCluster interface doesn't allow updating the masters or more 
specifically the ExternalMiniClusterOptions on restart.


http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/dynamic_multi_master-test.cc@930
PS1, Line 930:  /* Omitting "--master_support_change_config" */
> nit: maybe just set it to false now so we don't have to update this test wh
Done


http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/dynamic_multi_master-test.cc@938
PS1, Line 938: non_leader_master_idx
> Does it matter whether it's leader or non-leader master index?  I guess if
Good point. Added checks for both leader and non-leader masters.


http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/dynamic_multi_master-test.cc@995
PS1, Line 995: It's possible there is a leadership change between querying for 
leader master and
             :   // sending the add master request to non-leader master and 
hence using ASSERT_EVENTUALLY.
> In tablet server tests there is another way to protect against this -- disa
Using higher value of --leader_failure_max_missed_heartbeat_periods to avoid 
leader master transfers.
I tried --enable_leader_failure_detection=false for masters but leads to issues 
in electing a master on cluster startup. I didn't dive deeper.


http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/dynamic_multi_master-test.cc@1009
PS1, Line 1009:     
ASSERT_OK(cluster_->master_proxy(non_leader_master_idx)->RemoveMaster(req, 
&resp, &rpc));
> Yep, I'm curious whether this might end up in removing one master, and then
Scenario mentioned by Alexey would be possible.

Using higher value of --leader_failure_max_missed_heartbeat_periods to avoid 
leader master transfers.
I tried --enable_leader_failure_detection=false for masters but leads to issues 
in electing a master on cluster startup. I didn't dive deeper.


http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/dynamic_multi_master-test.cc@1019
PS1, Line 1019: TEST_F(DynamicMultiMasterTest, TestRemoveLeaderMaster) {
> Does it make sense to add a simple scenario to verify that a single and the
Done


http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/master.h
File src/kudu/master/master.h:

http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/master.h@131
PS1, Line 131: of
> nit: of initiating ?
Done


http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/master.cc
File src/kudu/master/master.cc:

http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/master.cc@523
PS1, Line 523:     if (HostPortFromPB(peer.last_known_addr()) == hp
> Could there be more than one master satisfying this criterion?  If so, what
Updated the code to catch that condition and return an error back.


http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/master.cc@532
PS1, Line 532: Substitute
> nit: nothing being substituted? Maybe add the leader's UUID?
Done


http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/master.cc@532
PS1, Line 532: leader master
> nit: I guess it might be not only the leader, right?  Maybe, just report th
Done


http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/master.proto
File src/kudu/master/master.proto:

http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/master.proto@941
PS1, Line 941: HostPortPB
> Why HostPostPB instead of UUID as an identifier of existing server is bette
For adding a master makes sense to use HostPortPB and removing better to use 
the unique id.
Changed to using uuid instead.


http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/master_service.cc
File src/kudu/master/master_service.cc:

http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/master/master_service.cc@277
PS1, Line 277:   if (!FLAGS_master_support_change_config) {
             :     rpc->RespondFailure(Status::NotSupported("Removing master is 
not supported"));
             :     return;
             :   }
> Hrm, can you remind me why it's insufficient to rely on the SupportsFeature
Pasting the comment added.
  // This feature flag is part of SupportsFeature() function in master service 
but it's possible
  // that a client may not specify the feature flag.
  // This check protects access to the server feature irrespective of whether 
the
  // feature flag is specified in the client.


http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/mini-cluster/external_mini_cluster.h
File src/kudu/mini-cluster/external_mini_cluster.h:

http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/mini-cluster/external_mini_cluster.h@483
PS1, Line 483:   // Removes the master from the ExternalMiniCluster when the 
master has been removed
             :   // dynamically after bringing up the ExternalMiniCluster.
> nit: just reading this, a few things aren't immediately clear:
Added a clarifying comment for #1. Will address the client part in a follow-up 
change.


http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/mini-cluster/external_mini_cluster.cc
File src/kudu/mini-cluster/external_mini_cluster.cc:

http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/mini-cluster/external_mini_cluster.cc@964
PS1, Line 964: ) {
> nit: Why not ++it here? Seems more idiomatic.
It's better to not increment iterator here when iterating over a collection and 
deleting an entry as erase() returns the next iterator. Hence placed it below.
However in this case we don't iterate once the desired entry is found and 
deleted, so I guess okay to move the iterator increment here.


http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/mini-cluster/external_mini_cluster.cc@966
PS1, Line 966: m
> nit: if this is only used once, maybe just inline it as (*it)->bound_rpc_ho
Done


http://gerrit.cloudera.org:8080/#/c/16936/1/src/kudu/mini-cluster/external_mini_cluster.cc@972
PS1, Line 972:
             :   r
> nit: drop the extra line
Done



--
To view, visit http://gerrit.cloudera.org:8080/16936
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I76c03b8850faef60b65f85184c0a4db7cc6759ee
Gerrit-Change-Number: 16936
Gerrit-PatchSet: 1
Gerrit-Owner: Bankim Bhavsar <[email protected]>
Gerrit-Reviewer: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Andrew Wong <[email protected]>
Gerrit-Reviewer: Bankim Bhavsar <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Mon, 25 Jan 2021 23:42:30 +0000
Gerrit-HasComments: Yes

Reply via email to