Andrew Wong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/18124 )

Change subject: KUDU-2915: add tool to remove a tablet server
......................................................................


Patch Set 2:

(12 comments)

http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/master/master.proto
File src/kudu/master/master.proto:

http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/master/master.proto@1076
PS2, Line 1076:   required string uuid = 1;
nit: we try to err away from using "required" in protobuf, given it's removed 
in proto3.


http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/master/master.proto@1077
PS2, Line 1077:   optional bool keep_tserver_state = 2;
              :   optional bool force_remove_live_tserver = 3;
nit: add a comment describing these args. Here's a start

// Whether or not to remove any persisted tserver state (e.g.
// MAINTENANCE_MODE). For this to take effect, the receiving master must be the
// leader.

// Whether to return an error in case the tserver is not presumed to be dead,
// per --tserver_unresponsive_timeout_ms.


http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/master/master_service.cc
File src/kudu/master/master_service.cc:

http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/master/master_service.cc@339
PS2, Line 339: keep_tserver_state
nit: maybe rename this to 'remove_tserver_state' so we don't have to keep 
dealing with the negative below? If so, could you also change the proto field 
name?


http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/master/master_service.cc@344
PS2, Line 344: if (!keep_tserver_state && !l.CheckIsInitializedOrRespond(resp, 
rpc)) {
             :     return;
             :   }
             :
             :   Status s = server_->ts_manager()->RemoveTServer(ts_uuid, 
force_remove_live_tserver);
             :   if (PREDICT_FALSE(!s.ok())) {
             :     rpc->RespondFailure(s);
             :     return;
             :   }
             :
             :   if (!keep_tserver_state && l.leader_status().ok()) {
             :     s = server_->ts_manager()->SetTServerState(
nit: may be easier to follow as:

if (!l.CheckIsInitializedOrRespond(...)) {
  return;
}
Status s = ts_manager->RemoveTServer(...);
if (!s.ok()) {
  ...
}
if (!keep_tserver_state) {
  if (l.leader_status().ok()) {
    ts_manager->SetTserverState(...);
  }
}


http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/master/master_service.cc@358
PS2, Line 358: DONT_ALLOW_MISSING_TSERVER
I'm curious what the rationale is behind not allowing a missing tserver. It 
seems like a nice property to have that calls to this RPC could be retriable 
and effectively idempotent.


http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/master/ts_manager.cc
File src/kudu/master/ts_manager.cc:

http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/master/ts_manager.cc@326
PS2, Line 326: Tablet Server
nit: let's try keeping usage of "tserver" vs "Tablet Server" vs "tablet server" 
homogenous at least without the same areas of code. Maybe just 'tserver $0 ..." 
here?


http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/tools/kudu-tool-test.cc
File src/kudu/tools/kudu-tool-test.cc:

http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/tools/kudu-tool-test.cc@7472
PS2, Line 7472:   opts.num_masters = 3;
              :   opts.extra_master_flags.push_back(
              :       Substitute("--tserver_unresponsive_timeout_ms=$0", 
kTserverUnresponsiveMs));
The --heartbeat_interval_ms default value is also 1s. Let's set it a bit higher 
on the tservers so we ensure the tserver becomes unresponsive


http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/tools/kudu-tool-test.cc@7480
PS2, Line 7480:   // Enter maintenance mode on the tserver, shutdown it and 
wait it become dead.
nit: "shut it down"


http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/tools/kudu-tool-test.cc@7483
PS2, Line 7483: SleepFor(MonoDelta::FromMilliseconds(kTserverUnresponsiveMs));
              :
              :   {
              :     string out;
              :     string err;
              :     // Getting an error when running ksck and the output 
contains the dead tserver.
              :     Status s =
              :         RunActionStdoutStderrString(Substitute("cluster ksck 
$0", master_addrs_str), &out, &err);
              :     ASSERT_TRUE(s.IsRuntimeError());
              :     ASSERT_STR_CONTAINS(
              :         out, Substitute("$0 | $1 | UNAVAILABLE", ts_uuid, 
ts->bound_rpc_hostport().ToString()));
              :   }
nit: rather than sleeping and expecting the exact sleep to work, could we 
instead remove the sleep and wrap this in an ASSERT_EVENTUALLY()? Seems that 
would be less prone to flakiness


http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/tools/kudu-tool-test.cc@7509
PS2, Line 7509:     ASSERT_STR_CONTAINS(out,
              :                         Substitute("Tablet Server States\n"
              :                                    "              Server        
      |      State\n"
              :                                    
"----------------------------------+------------------\n"
              :                                    " $0 | MAINTENANCE_MODE",
              :                                    ts_uuid));
Maybe after this, could we restart the tserver with a very short heartbeat 
interval such that it is guaranteed to be alive, and then run 'kudu tserver 
remove -force_remove_live_tserver -keep_tserver_state=false'? Just so we have 
fuller test coverage. I would expect that after that, we'd see no 
MAINTENANCE_MODE message, but we should be able to see the ts_uuid in the ksck 
output


http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/tools/kudu-tool-test.cc@7508
PS2, Line 7508:  if (keep_tserver_state) {
              :     ASSERT_STR_CONTAINS(out,
              :                         Substitute("Tablet Server States\n"
              :                                    "              Server        
      |      State\n"
              :                                    
"----------------------------------+------------------\n"
              :                                    " $0 | MAINTENANCE_MODE",
              :                                    ts_uuid));
              :   }
In the else case, can we ASSERT_STR_NOT_CONTAINS("MAINTENANCE_MODE")? We should 
also be able to assert the ts_uuid isn't in ksck at all, right?


http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/tools/kudu-tool-test.cc@7541
PS2, Line 7541:       RunKuduTool({"tserver", "remove", master_addrs_str, 
ts_uuid, "-force_remove_live_tserver"}));
Could we also set a high tserver heartbeat interval, and then check afterwards 
that the tserver doesn't show up in ksck?



--
To view, visit http://gerrit.cloudera.org:8080/18124
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If1f5c2979a8d14428f4bcc8e850c57ce228c793a
Gerrit-Change-Number: 18124
Gerrit-PatchSet: 2
Gerrit-Owner: Yifan Zhang <[email protected]>
Gerrit-Reviewer: Abhishek Chennaka <[email protected]>
Gerrit-Reviewer: Andrew Wong <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yifan Zhang <[email protected]>
Gerrit-Reviewer: Zoltan Chovan <[email protected]>
Gerrit-Comment-Date: Fri, 14 Jan 2022 00:35:40 +0000
Gerrit-HasComments: Yes

Reply via email to