Mike Percy has submitted this change and it was merged. Change subject: KUDU-1407: reassign failed tablets ......................................................................
KUDU-1407: reassign failed tablets Tablets put into the state tablet::FAILED are left until they are manually deleted; they are not evicted and reassigned. If a tablet fails to bootstrap, it will sit, responding to heartbeats, doing nothing else. This patch ensures failed tablets will be reassigned. As the tablets are not used, rather than directly setting replicas to FAILED, an error is first recorded and the TabletReplica::Shutdown(), leaving the final state as FAILED. A replica can no longer leave the FAILED state (calls to Shutdown() leave it FAILED). The tserver response generated by FAILED tablets is now TABLET_FAILED. Upon receiving this, a leader will immediately evict the peer. Prior to this patch, a tablet was marked FAILED if its WAL or metadata failed to delete (after already shutting down). If this occurs, there is no guarantee that the tablet's metadata reflects the deleted state. This has been made fatal. Testing is done in a few places: - raft_consensus-itest is updated to ensure that tablets that fail to bootstrap are evicted and replaced. - tablet_server-test is also updated to ensure that, instead of TABLET_NOT_RUNNING, TABLET_FAILED is returned by failed tablets. - a test is added to ts_tablet_manager-itest to test that a tablet that is manually failed while running is evicted and replaced. This patch is a part of a series of patches to handle disk failure. See section 2.5 in this doc: https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345 Reviewed-on: http://gerrit.cloudera.org:8080/7440 Tested-by: Kudu Jenkins Reviewed-by: Mike Percy <[email protected]> --- M src/kudu/client/scanner-internal.cc M src/kudu/consensus/consensus_peers.cc M src/kudu/consensus/consensus_queue.cc M src/kudu/consensus/consensus_queue.h M src/kudu/integration-tests/raft_consensus-itest.cc M src/kudu/integration-tests/ts_recovery-itest.cc M src/kudu/integration-tests/ts_tablet_manager-itest.cc M src/kudu/tablet/tablet_replica.cc M src/kudu/tablet/tablet_replica.h M src/kudu/tserver/tablet_server-test.cc M src/kudu/tserver/tablet_service.cc M src/kudu/tserver/ts_tablet_manager.cc M src/kudu/tserver/tserver.proto 13 files changed, 154 insertions(+), 58 deletions(-) Approvals: Mike Percy: Looks good to me, approved Kudu Jenkins: Verified -- To view, visit http://gerrit.cloudera.org:8080/7440 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: merged Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345 Gerrit-PatchSet: 21 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Andrew Wong <[email protected]> Gerrit-Reviewer: Adar Dembo <[email protected]> Gerrit-Reviewer: Andrew Wong <[email protected]> Gerrit-Reviewer: David Ribeiro Alves <[email protected]> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <[email protected]> Gerrit-Reviewer: Tidy Bot Gerrit-Reviewer: Todd Lipcon <[email protected]>
