Adar Dembo has posted comments on this change. Change subject: KUDU-1337. Avoid spurious remote bootstraps on DeleteTablet() ......................................................................
Patch Set 1: (10 comments) http://gerrit.cloudera.org:8080/#/c/2436/1//COMMIT_MSG Commit Message: Line 13: the leader processes the RPC call sending the DeleteTablet() call to it, Nit: the language here is a little confusing. How about: When a table is deleted, the master sends a DeleteTablet() RPC to every replica of every tablet with the TABLET_DATA_DELETED parameter, which indicates a "permanent" tablet deletion. If a follower services DeleteTablet() before the leader does, it's possible for the leader to react to the missing replica by asking the follower to remote bootstrap itself. http://gerrit.cloudera.org:8080/#/c/2436/1/src/kudu/integration-tests/remote_bootstrap-itest.cc File src/kudu/integration-tests/remote_bootstrap-itest.cc: Line 743: // a follower tablet is deleted and the leader notices, before it has processed Nit: I don't think you need the comma after 'notices'. Line 745: TEST_F(RemoteBootstrapITest, TestRemoteBootstrappingDeletedTabletFails) { Neat test, though I was expecting something a little less synthetic: 1. Start a three node cluster, preferably with a low raft consensus heartbeat interval so that there's a higher chance of the leader "noticing" the lack of a replica on a follower. 2. Create a table. 3. Wait for the table to be created. 4. Wait for a leader to be elected (can be skipped if this is hard). 5. Delete the table. 6. Verify using remote bootstrap metrics (if they don't exist, they should) that no remote bootstraps were started. It's not guaranteed to fail before your fix every time (since you can't control the order in which replicas process DeleteTablet()) but I like it because it's more like a real workload and relies less on complex test methods. What do you think? Perhaps it could supplement this test? Line 775: workload.Start(); Does the test actually need rows to be inserted? Isn't the presence of created tablets sufficient? Line 788: cluster_->tablet_server(1)->uuid(), For completeness, how about you test a remote bootstrap on both followers? Line 790: 0, // Say I'm from term 0. Nit: not clear _why_ you're passing 0 for the term value. Line 799: ASSERT_OK(itest::StartRemoteBootstrap(leader, tablet_id, Both followers here too. Line 804: ASSERT_OK(WaitForServersToAgree(kTimeout, ts_map_, tablet_id, workload.batches_completed() + 1)); Why is this last call needed? http://gerrit.cloudera.org:8080/#/c/2436/1/src/kudu/tserver/ts_tablet_manager.h File src/kudu/tserver/ts_tablet_manager.h: Line 319: // transition_in_progress_. Update this. Line 328: std::unordered_set<string> perm_deleted_tablet_ids_; This set grows unbounded, do we care? -- To view, visit http://gerrit.cloudera.org:8080/2436 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I852f8ac70e1f6496127598e5e02de5b72711ab2b Gerrit-PatchSet: 1 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Mike Percy <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-HasComments: Yes
