Hello Mike Percy, Kudu Jenkins, I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/9452 to look at the new patch set (#6). Change subject: KUDU-2293 fix cleanup after failed copies ...................................................................... KUDU-2293 fix cleanup after failed copies Before, when a tablet server failed to tablet copy, Kudu would perform a best-effort cleanup of the partially-copied replica and leave the tablet tombstoned. If this tombstoning were to fail due to disk issues (e.g. out of space), Kudu would allow this and tablet would remain in TABLET_DATA_COPYING both in-memory and on-disk. This would lead to a CHECK failure if another tablet copy were started for the same replica, as the server would attempt to copy over a replica with data already marked TABLET_DATA_COPYING. This behavior arose from trying to balance two invariants: - keep on-disk state consistent with in-memory state when possible - when a tablet copy fails, leave it in as dead of a state as we can (i.e. TABLET_DATA_TOMBSTONED with no transitions in progress) This patch updates the Abort() logic to lean towards the latter invariant: if a tablet copy fails, at least its in-memory state will be set as TABLET_DATA_TOMBSTONED. This may not be true on-disk, but that's okay because either 1) the tablet server will eventually overwrite it via another tablet copy (at which time its data must _not_ be in the TABLET_DATA_COPYING state), or 2) the server will be restarted and the tablet will be tombstoned upon seeing a non-TABLET_DATA_READY state anyway. We use the tablet copy internal state machine to determine whether to do anything during Abort(). This wasn't always right before, since the state machine would only update at the completion of various methods (e.g. Start()), which didn't necessarily reflect whether or not clean up was necessary. This patch updates the state machine to be slightly more fine-grained, introducing new states kStarting and kFinishing that indicate that there exists state to clean up (even if, e.g., Start() has not completed successfully). A test is added to tablet_copy-itest that tests failures to copy, ensuring that the tablet is left in such an state that further copies are possible without crashing. The test uses EIO injection to fail the copies, but the logic is the same as if full disks were used instead. Change-Id: I0459d484a3064aa2de392328e2910c9f6741be04 --- M src/kudu/integration-tests/tablet_copy-itest.cc M src/kudu/tablet/tablet_metadata.h M src/kudu/tserver/tablet_copy_client.cc M src/kudu/tserver/tablet_copy_client.h 4 files changed, 166 insertions(+), 22 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/52/9452/6 -- To view, visit http://gerrit.cloudera.org:8080/9452 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: kudu Gerrit-Branch: master Gerrit-MessageType: newpatchset Gerrit-Change-Id: I0459d484a3064aa2de392328e2910c9f6741be04 Gerrit-Change-Number: 9452 Gerrit-PatchSet: 6 Gerrit-Owner: Andrew Wong <aw...@cloudera.com> Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <mpe...@apache.org>