Andrew Wong has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/14223 )
Change subject: KUDU-2069 p5: recheck tablet health when exiting maintenance mode ...................................................................... KUDU-2069 p5: recheck tablet health when exiting maintenance mode Previously, when exiting maintenance mode for a given tserver, if the replicas of that tserver were unhealthy, there was no mechanism with which to guarantee that the proper re-replication would happen. Specifically, the following sequence of events was possible: 1. tablet T has replicas on tservers A, B*, C 2. A enters maintenance mode 3. A is shut down 4. enough time passes for B* to consider A as failed 5. B* notices the failure of A and reports to the master that replica A has failed 6. the master does nothing to schedule re-replication because A is in maintenance mode 7. A exits maintenance mode, but is not brought back online 8. B* never hears back from A, and never hits a health state change to report to the master, and so the master never "rechecks" the health of T 9. T is left under-replicated with only B* and C Note: The set of tservers we need to recheck is the set that hosted a leader of any replica on A. This patch addresses this by requesting a full tablet report from every tserver in the cluster upon exiting maintenance mode on any tserver. Testing: - this adds to the existing integration test for maintenance mode to exercise the new behavior - amends an existing concurrency test to verify the correct locking behavior is used Change-Id: Ic0ab3d78cbc5b1228c01592a00118f11f76e43dd Reviewed-on: http://gerrit.cloudera.org:8080/14223 Tested-by: Kudu Jenkins Reviewed-by: Adar Dembo <[email protected]> Reviewed-by: Alexey Serbin <[email protected]> --- M src/kudu/integration-tests/maintenance_mode-itest.cc M src/kudu/master/master_service.cc M src/kudu/master/ts_descriptor.cc M src/kudu/master/ts_descriptor.h M src/kudu/master/ts_manager.cc M src/kudu/master/ts_manager.h M src/kudu/master/ts_state-test.cc 7 files changed, 74 insertions(+), 2 deletions(-) Approvals: Kudu Jenkins: Verified Adar Dembo: Looks good to me, but someone else must approve Alexey Serbin: Looks good to me, approved -- To view, visit http://gerrit.cloudera.org:8080/14223 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: Ic0ab3d78cbc5b1228c01592a00118f11f76e43dd Gerrit-Change-Number: 14223 Gerrit-PatchSet: 7 Gerrit-Owner: Andrew Wong <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Alexey Serbin <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-Reviewer: Hao Hao <[email protected]> Gerrit-Reviewer: Kudu Jenkins (120)
