Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7440

to look at the new patch set (#5).

Change subject: disk failure: reassign failed tablets
......................................................................

disk failure: reassign failed tablets

Tablets put into the state tablet::FAILED are left until they are
manually deleted. This is an issue because failed tablets don't get
evicted and reassigned (e.g. if a tablet fails to bootstrap, it will
sit, responding to heartbeats, doing nothing else).

To remediate this, this patch changes the tserver response generated by
FAILED tablets to a new TABLET_FAILED, which is ignored by leaders to
promote eviction.

Additionally, a new tablet state is added: FAILED_AND_SHUTDOWN.  Like
QUIESCING and SHUTDOWN, TabletReplica::Shutdown() can wait on
FAILED_AND_SHUTDOWN. This is useful if a failed tablet needs to be shut
down and still needs to be reassigned. Calling normal Shutdown() cannot
leave the replica in the FAILED state, and the SHUTDOWN state cannot
itself indicate the need for eviction.

Additionally, prior to this patch, tablets were set to FAILED when they
failed to delete metadata. This is no longer the case. Since error
statuses during deletion are only returned during IO to the metadata
directory, and because the metadata directory is a single point of
failure, failures on this codepath are made fatal for now. Once this is
no longer the case, these failures should be made benign, as proper
error handling should make files on the failed metadata directory
unreachable. This ensures the tablets that were meant to be deleted are
not reassigned.

This patch is a part of a series of patches to handle disk failure. See
section 2.5 in this doc:
https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit

Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345
---
M src/kudu/client/scanner-internal.cc
M src/kudu/consensus/consensus_peers.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/tablet/metadata.proto
M src/kudu/tablet/tablet_replica.cc
M src/kudu/tablet/tablet_replica.h
M src/kudu/tserver/tablet_server-test.cc
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/tserver.proto
12 files changed, 111 insertions(+), 55 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/40/7440/5
-- 
To view, visit http://gerrit.cloudera.org:8080/7440
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I5f61585b02fbe270d215bf7f49c0d390ceee3345
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <davidral...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mpe...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <t...@apache.org>

Reply via email to