Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7440

to look at the new patch set (#8).

Change subject: disk failure: reassign failed tablets
......................................................................

disk failure: reassign failed tablets

Tablets put into the state tablet::FAILED are left until they are
manually deleted. This is an issue because failed tablets don't get
evicted and reassigned (e.g. if a tablet fails to bootstrap, it will
sit, responding to heartbeats, doing nothing else).

To remediate this, this patch changes the tserver response generated by
FAILED tablets to a new TABLET_FAILED state, on which a leader will
immediately evict the peer.

Additionally, a new tablet state is added: FAILED_AND_SHUTDOWN. Like
QUIESCING and SHUTDOWN, TabletReplica::Shutdown() can wait on
FAILED_AND_SHUTDOWN. This is useful if a failed tablet needs to be shut
down and still needs to be reassigned. Calling normal Shutdown() cannot
leave the replica in the FAILED state, and the SHUTDOWN state cannot
itself indicate the need for eviction.

Prior to this patch, tablets were set to FAILED when they failed to
delete metadata. This is no longer the case. Since error statuses during
deletion are only returned during IO to the metadata directory, and
because the metadata directory is a single point of failure, failures on
this codepath are made fatal for now. Once this is no longer the case,
these failures should be made benign, as proper error handling should
make files in the failed metadata directory unreachable. This ensures
the tablets that were meant to be deleted are not reassigned.

The test raft_consensus-itest is updated to ensure that failed tablets
are evicted and replaced. The test tablet_server-test is also updated to
ensure that, instead of TABLET_NOT_RUNNING, TABLET_FAILED is returned by
failed tablets.

This patch is a part of a series of patches to handle disk failure. See
section 2.5 in this doc:
https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit

Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345
---
M src/kudu/client/scanner-internal.cc
M src/kudu/consensus/consensus_peers.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/integration-tests/ts_recovery-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/tablet/metadata.proto
M src/kudu/tablet/tablet_replica.cc
M src/kudu/tablet/tablet_replica.h
M src/kudu/tserver/tablet_server-test.cc
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/tserver.proto
14 files changed, 131 insertions(+), 65 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/40/7440/8
-- 
To view, visit http://gerrit.cloudera.org:8080/7440
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <davidral...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mpe...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <t...@apache.org>

Reply via email to