Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/14048 )
Change subject: KUDU-2915 Support to delete dead tservers from CLI ...................................................................... Patch Set 1: (2 comments) > Patch Set 1: > > Have we considered alternatives here? eg if the cluster is fully recovered > from an outage, maybe we shouldn't have KSCK show a downed tablet server as a > "bad" status? > > Another option that might be a bit better is to allow an option to forget a > _specific_ dead TS -- otherwise it seems a bit heavy-handed to forget _all_ > dead tablet servers. Seems I misunderstood the original intent of KUDU-2915. I thought it was the second option that Todd mentioned, i.e. a way to forget about a single dead tablet server. See here https://docs.google.com/document/d/12BZqspGjHvQlc-o8XTDixoRol9Q36WJzXLJ6p15Zhf0/edit#heading=h.f7wecdcqsbe for my thoughts on how I think that would be useful in the context of tserver decommissioning. I also chatted with Adar about this in person, and one thing worth considering is that this is basically the last step of a decommissioning, where the full steps for decommissioning would look something like: 1. Mark a tablet server as being decommissioning to avoid replica placement onto that tablet server. 2. Move all replicas away from the tablet server. 3. Once empty, (either automatically or with a tool) indicate that the tablet server has been successfully decommissioned by having the master forget about the decommissioned tablet server. If we want to build towards this without introducing redundant tooling, instead of having a `kudu master delete_tserver` tool and then eventually introducing `kudu cluster decommission` tool, we could introduce the "delete single tserver" functionality as `kudu cluster decommission` which would, for now, only do #3. #1 and #2 should be left as future work to build on top of this tool. What do you think? http://gerrit.cloudera.org:8080/#/c/14048/1/src/kudu/master/master_service.cc File src/kudu/master/master_service.cc: http://gerrit.cloudera.org:8080/#/c/14048/1/src/kudu/master/master_service.cc@638 PS1, Line 638: void MasterServiceImpl::DeleteDeadTServerInMaster(const DeleteDeadTabletServersRequestPB* req, The registration of a tablet server isn't persisted to disk. Is it important to take the leader lock when performing this action? http://gerrit.cloudera.org:8080/#/c/14048/1/src/kudu/master/master_service.cc@650 PS1, Line 650: void MasterServiceImpl::DeleteDeadTabletServers(const DeleteDeadTabletServersRequestPB* req, I don't see the advantage of having the master relay these RPCs to each other, versus having the tool instantiate multiple master proxies and sending the requests to each master individually. Does "forgetting" a tablet server _need_ to go to the leader first? -- To view, visit http://gerrit.cloudera.org:8080/14048 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I689cfb02a1ae44c4d941e83e3a8cf6e14c7911c7 Gerrit-Change-Number: 14048 Gerrit-PatchSet: 1 Gerrit-Owner: honeyhexin <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120) Gerrit-Reviewer: Tidy Bot (241) Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-Comment-Date: Mon, 12 Aug 2019 18:53:29 +0000 Gerrit-HasComments: Yes
