Dinesh Bhat has posted comments on this change. Change subject: [consensus] KUDU-1613: Fix replica eviction failure for WRONG_SERVER_UUID ......................................................................
Patch Set 5: (7 comments) http://gerrit.cloudera.org:8080/#/c/5111/5//COMMIT_MSG Commit Message: Line 9: Fixes replica eviction from leader where a follower returning > Can you please explain your approach to fixing this issue in the commit mes Done, could you see updated version would suffice ? http://gerrit.cloudera.org:8080/#/c/5111/5/src/kudu/consensus/consensus_peers.cc File src/kudu/consensus/consensus_peers.cc: Line 238: switch (ps.response) { > Maybe add a default case that calls LOG(FATAL) on the (unknown) response ty Done http://gerrit.cloudera.org:8080/#/c/5111/5/src/kudu/consensus/consensus_peers.h File src/kudu/consensus/consensus_peers.h: PS5, Line 140: respond with appropriate response status. > Nit: technically, this method does not "respond" (in the sense of sending a yeah, good catch. fixed. http://gerrit.cloudera.org:8080/#/c/5111/5/src/kudu/consensus/consensus_queue-test.cc File src/kudu/consensus/consensus_queue-test.cc: Line 816: // Verify that tablet copy is triggered when peer responds with > Why do we want tablet copy to be triggered when the peer responds that it i I copied this line-by-line from TestTriggerTabletCopyIfTabletNotFound above. In this case, WRONG_SERVER_UUID almost always means that for some reason tablet server went through reinitialization, which means all the tablets become inaccessible due to uuid mismatch for all RPCs via CheckUuidMatchOrRespond after the reinitialization. In some way, this is the same situation as TABLET_NOT_FOUND but impact is more because it affected all replicas on that node. As for why tablet copy is necessary, I think it's probably more relevant when the tablet becomes under-replicated. I added this test really for that 1-line change in consensus_queue.cc. http://gerrit.cloudera.org:8080/#/c/5111/5/src/kudu/integration-tests/raft_consensus-itest.cc File src/kudu/integration-tests/raft_consensus-itest.cc: Line 2770 > Why was this comment removed? If it's no longer true, shouldn't the entire yeah, this line of code(and comment was supposed to be removed in next patch). I removed both comment+code in this patch itself initially, but added only code back realizing that this code is testing TABLET_NOT_RUNNING situation(forgot to add comment back). I think I am going to leave it without comments for now because my my next patch laid on top of this is removing both anyways, and it's a somewhat messy rebase situation otherwise. PS5, Line 2782: // These tests exhibit the replica eviction behaviors when a follower : // returns WRONG_SERVER_UUID error code. Tests verify that followers : // returning these error codes are evicted from consensus after a : // specified --follower_unavailable_considered_failed_sec. > Nit: this is talking about tests in the plural, but it's just a single test done, yeah I think I forgot to change comments when I split one patch into multiple patches. Line 2795: int leader_indx = -1; > Nit: conventionally, we use 'idx' as an infix for naming an index, not 'ind Done -- To view, visit http://gerrit.cloudera.org:8080/5111 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I0d3f84fd297a8e4760208a213c1ee393e92499a3 Gerrit-PatchSet: 5 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Dinesh Bhat <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: David Ribeiro Alves <[email protected]> Gerrit-Reviewer: Dinesh Bhat <[email protected]> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <[email protected]> Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <[email protected]> Gerrit-HasComments: Yes
