Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/18124 )
Change subject: KUDU-2915: add tool to remove a tablet server ...................................................................... Patch Set 2: (12 comments) http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/master/master.proto File src/kudu/master/master.proto: http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/master/master.proto@1076 PS2, Line 1076: required string uuid = 1; nit: we try to err away from using "required" in protobuf, given it's removed in proto3. http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/master/master.proto@1077 PS2, Line 1077: optional bool keep_tserver_state = 2; : optional bool force_remove_live_tserver = 3; nit: add a comment describing these args. Here's a start // Whether or not to remove any persisted tserver state (e.g. // MAINTENANCE_MODE). For this to take effect, the receiving master must be the // leader. // Whether to return an error in case the tserver is not presumed to be dead, // per --tserver_unresponsive_timeout_ms. http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/master/master_service.cc File src/kudu/master/master_service.cc: http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/master/master_service.cc@339 PS2, Line 339: keep_tserver_state nit: maybe rename this to 'remove_tserver_state' so we don't have to keep dealing with the negative below? If so, could you also change the proto field name? http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/master/master_service.cc@344 PS2, Line 344: if (!keep_tserver_state && !l.CheckIsInitializedOrRespond(resp, rpc)) { : return; : } : : Status s = server_->ts_manager()->RemoveTServer(ts_uuid, force_remove_live_tserver); : if (PREDICT_FALSE(!s.ok())) { : rpc->RespondFailure(s); : return; : } : : if (!keep_tserver_state && l.leader_status().ok()) { : s = server_->ts_manager()->SetTServerState( nit: may be easier to follow as: if (!l.CheckIsInitializedOrRespond(...)) { return; } Status s = ts_manager->RemoveTServer(...); if (!s.ok()) { ... } if (!keep_tserver_state) { if (l.leader_status().ok()) { ts_manager->SetTserverState(...); } } http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/master/master_service.cc@358 PS2, Line 358: DONT_ALLOW_MISSING_TSERVER I'm curious what the rationale is behind not allowing a missing tserver. It seems like a nice property to have that calls to this RPC could be retriable and effectively idempotent. http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/master/ts_manager.cc File src/kudu/master/ts_manager.cc: http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/master/ts_manager.cc@326 PS2, Line 326: Tablet Server nit: let's try keeping usage of "tserver" vs "Tablet Server" vs "tablet server" homogenous at least without the same areas of code. Maybe just 'tserver $0 ..." here? http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/tools/kudu-tool-test.cc File src/kudu/tools/kudu-tool-test.cc: http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/tools/kudu-tool-test.cc@7472 PS2, Line 7472: opts.num_masters = 3; : opts.extra_master_flags.push_back( : Substitute("--tserver_unresponsive_timeout_ms=$0", kTserverUnresponsiveMs)); The --heartbeat_interval_ms default value is also 1s. Let's set it a bit higher on the tservers so we ensure the tserver becomes unresponsive http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/tools/kudu-tool-test.cc@7480 PS2, Line 7480: // Enter maintenance mode on the tserver, shutdown it and wait it become dead. nit: "shut it down" http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/tools/kudu-tool-test.cc@7483 PS2, Line 7483: SleepFor(MonoDelta::FromMilliseconds(kTserverUnresponsiveMs)); : : { : string out; : string err; : // Getting an error when running ksck and the output contains the dead tserver. : Status s = : RunActionStdoutStderrString(Substitute("cluster ksck $0", master_addrs_str), &out, &err); : ASSERT_TRUE(s.IsRuntimeError()); : ASSERT_STR_CONTAINS( : out, Substitute("$0 | $1 | UNAVAILABLE", ts_uuid, ts->bound_rpc_hostport().ToString())); : } nit: rather than sleeping and expecting the exact sleep to work, could we instead remove the sleep and wrap this in an ASSERT_EVENTUALLY()? Seems that would be less prone to flakiness http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/tools/kudu-tool-test.cc@7509 PS2, Line 7509: ASSERT_STR_CONTAINS(out, : Substitute("Tablet Server States\n" : " Server | State\n" : "----------------------------------+------------------\n" : " $0 | MAINTENANCE_MODE", : ts_uuid)); Maybe after this, could we restart the tserver with a very short heartbeat interval such that it is guaranteed to be alive, and then run 'kudu tserver remove -force_remove_live_tserver -keep_tserver_state=false'? Just so we have fuller test coverage. I would expect that after that, we'd see no MAINTENANCE_MODE message, but we should be able to see the ts_uuid in the ksck output http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/tools/kudu-tool-test.cc@7508 PS2, Line 7508: if (keep_tserver_state) { : ASSERT_STR_CONTAINS(out, : Substitute("Tablet Server States\n" : " Server | State\n" : "----------------------------------+------------------\n" : " $0 | MAINTENANCE_MODE", : ts_uuid)); : } In the else case, can we ASSERT_STR_NOT_CONTAINS("MAINTENANCE_MODE")? We should also be able to assert the ts_uuid isn't in ksck at all, right? http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/tools/kudu-tool-test.cc@7541 PS2, Line 7541: RunKuduTool({"tserver", "remove", master_addrs_str, ts_uuid, "-force_remove_live_tserver"})); Could we also set a high tserver heartbeat interval, and then check afterwards that the tserver doesn't show up in ksck? -- To view, visit http://gerrit.cloudera.org:8080/18124 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If1f5c2979a8d14428f4bcc8e850c57ce228c793a Gerrit-Change-Number: 18124 Gerrit-PatchSet: 2 Gerrit-Owner: Yifan Zhang <[email protected]> Gerrit-Reviewer: Abhishek Chennaka <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Yifan Zhang <[email protected]> Gerrit-Reviewer: Zoltan Chovan <[email protected]> Gerrit-Comment-Date: Fri, 14 Jan 2022 00:35:40 +0000 Gerrit-HasComments: Yes
