[kudu-CR] disk failure: add persistent disk states

Andrew Wong (Code Review) Thu, 10 Aug 2017 17:40:18 -0700

Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

http://gerrit.cloudera.org:8080/7270

to look at the new patch set (#8).

Change subject: disk failure: add persistent disk states
......................................................................

disk failure: add persistent disk states

This patch adds a new version of the path set so failed disks are not
used the next time the server is run. This is needed, as data on a
failed disk may be partially written and thus should not be used.

To accomplish this, disk states, path names, and a timestamp are added
to the path set. Additionally, if any disks are failed when loading
instances from disk, an 'unhealthy' instance with no path set metadata
will be created instead of failing the startup process.

CheckIntegrity() is updated to accomodate this. Rather than
comparing all instances against an agreed-upon set of UUIDs, the single
path set with the largest timestamp is used as the main one to compare
against (if only old path sets are available, the first healthy one is
used). If no instances are healthy, CheckIntegrity() will fail, as all
disks are failed.

The main path set is checked to ensure its integrity (e.g. no
duplicates), after which it is upgraded with the extra metadata if
needed. It is then checked to ensure it is consistent with the healthy
instances (e.g. having the right UUIDs and paths).

Testing is done in a new iteration of CheckIntegrity(). End-to-end
testing is done in a later patch.

Some notes:
- In the case of a server restart where all healthy disks fail to start
up and some known failed disks start working again, the server will
successfully start up with the bad disks (may have partially-written
data).
- If there are any unhealthy instances when upgrading the path sets (i.e.
adding disk states, paths, timestamp), a complete mapping of UUIDs to
paths will not be available, and CheckIntegrity() will fail.
- The main path set's disk states are updated to reflect the failure of
the instances. Since data on a failed disk cannot be trusted, disks
that are successfully read from but are already marked FAILED by the
main path set are not marked HEALTHY.

This patch is a part of a series of patches to handle disk failures. To
see how this fits, see 2.6 in:
https://docs.google.com/document/d/1zZk-vb_ETKUuePcZ9ZqoSK2oPvAAaEV1sjDXes8Pxgk/edit?usp=sharing

Change-Id: Ifddf0817fe1a82044077f5544c400c88de20769f
---
M src/kudu/fs/block_manager_util-test.cc
M src/kudu/fs/block_manager_util.cc
M src/kudu/fs/block_manager_util.h
M src/kudu/fs/data_dirs-test.cc
M src/kudu/fs/data_dirs.cc
M src/kudu/fs/data_dirs.h
M src/kudu/fs/fs.proto
M src/kudu/fs/log_block_manager-test.cc
M src/kudu/fs/log_block_manager.cc
9 files changed, 770 insertions(+), 195 deletions(-)

git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/70/7270/8
--
To view, visit http://gerrit.cloudera.org:8080/7270
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ifddf0817fe1a82044077f5544c400c88de20769f
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Andrew Wong <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Andrew Wong <[email protected]>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <[email protected]>

[kudu-CR] disk failure: add persistent disk states

Reply via email to