Hello Mike Percy, Kudu Jenkins,

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/9452

to look at the new patch set (#7).

Change subject: KUDU-2293 fix cleanup after failed copies
......................................................................

KUDU-2293 fix cleanup after failed copies

Before, when a tablet server failed to tablet copy, Kudu would perform a
best-effort cleanup of the partially-copied replica and leave the tablet
tombstoned. If this tombstoning were to fail due to disk issues (e.g.
out of space), Kudu would allow this and tablet would remain in
TABLET_DATA_COPYING both in-memory and on-disk. This would lead to a
CHECK failure if another tablet copy were started for the same replica,
as the server would attempt to copy over a replica with data already
marked TABLET_DATA_COPYING.

This behavior arose from trying to balance two invariants:
- keep on-disk state consistent with in-memory state when possible
- when a tablet copy fails, leave it in as dead of a state as we can
  (i.e. TABLET_DATA_TOMBSTONED with no transitions in progress)

This patch updates the Abort() logic to lean towards the latter
invariant: if a tablet copy fails, at least its in-memory state will be
set as TABLET_DATA_TOMBSTONED. This may not be true on-disk, but that's
okay because either 1) the tablet server will eventually overwrite it
via another tablet copy (at which time its data must _not_ be in the
TABLET_DATA_COPYING state), or 2) the server will be restarted and the
tablet will be tombstoned upon seeing a non-TABLET_DATA_READY state
anyway.

We use the tablet copy internal state machine to determine whether to do
anything during Abort(). This wasn't always right before, since the
state machine would only update at the completion of various methods
(e.g. Start()), which didn't necessarily reflect whether or not clean
up was necessary. This patch updates the state machine to be slightly
more fine-grained, introducing new states kStarting and kFinishing that
indicate that there exists state to clean up (even if, e.g., Start() has
not completed successfully).

A test is added to tablet_copy-itest that tests failures to copy,
ensuring that the tablet is left in such an state that further copies
are possible without crashing. The test uses EIO injection to fail the
copies, but the logic is the same as if full disks were used instead.

Change-Id: I0459d484a3064aa2de392328e2910c9f6741be04
---
M src/kudu/integration-tests/tablet_copy-itest.cc
M src/kudu/tablet/tablet_metadata.h
M src/kudu/tserver/tablet_copy_client.cc
M src/kudu/tserver/tablet_copy_client.h
4 files changed, 169 insertions(+), 22 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/52/9452/7
--
To view, visit http://gerrit.cloudera.org:8080/9452
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0459d484a3064aa2de392328e2910c9f6741be04
Gerrit-Change-Number: 9452
Gerrit-PatchSet: 7
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mpe...@apache.org>

Reply via email to