Hello Adar Dembo,
I'd like you to do a code review. Please visit
http://gerrit.cloudera.org:8080/15028
to review the following change.
Change subject: tablet_copy: fail tablet when concurrent IO fails copy
......................................................................
tablet_copy: fail tablet when concurrent IO fails copy
I saw a precommit failure of TabletCopyClientSessionITest
TestStopCopyOnClientDiskFailure fail because the number of replicas
failed by the end of the test didn't converge to the desired number.
Digging into this more, despite every tablet spanning every data
directory, a tablet that was being copied (and eventually failed to
copy) wasn't being marked as failed. This was caused by a race along the
lines of the following:
T1: Nears completion to copy tablet A.
T2: Begins to receive a copy of data for tablet B.
T1: Hits a disk failure on /data/1.
T1: Fails all the tablets in /data/1. While the /data/1 is registered
with tablet B, the replica for B is no yet registered.
T2: Registers tablet B with the tablet manager.
T2: The copy fails because tablet B is in a failed directory.
T2: Data for failed copy of B is cleaned up, but the replica is never
marked as failed. Instead, it is never bootstrapped, and is left in
the INITIALIZED state.
Note: the race doesn't need to be two tablet copies racing -- it could
be a copy and any other concurrent IO.
The fix is to ensure that tablet B fails itself in case it fails, as we
do elsewhere in the copy/bootstrap process.
I tweaked TabletCopyClientSessionITest.TestStopCopyOnClientDiskFailure
to see this race, and I saw it 2/1000 times. With this patch, it passed
5000/5000 times.
Change-Id: Ie270f435174fb8fba2adea21a5fbb48f3e56e5cb
---
M src/kudu/integration-tests/tablet_copy_client_session-itest.cc
M src/kudu/tserver/ts_tablet_manager.cc
2 files changed, 35 insertions(+), 33 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/28/15028/1
--
To view, visit http://gerrit.cloudera.org:8080/15028
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: Ie270f435174fb8fba2adea21a5fbb48f3e56e5cb
Gerrit-Change-Number: 15028
Gerrit-PatchSet: 1
Gerrit-Owner: Andrew Wong <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>