Andrew Wong has submitted this change and it was merged. (
http://gerrit.cloudera.org:8080/15028 )
Change subject: [tablet_copy] KUDU-2496: fail tablet when concurrent IO fails
copy
......................................................................
[tablet_copy] KUDU-2496: fail tablet when concurrent IO fails copy
I saw a precommit failure of TabletCopyClientSessionITest
TestStopCopyOnClientDiskFailure fail because the number of replicas
failed by the end of the test didn't converge to the desired number.
Digging into this more, despite every tablet spanning every data
directory, a tablet that was being copied (and eventually failed to
copy) wasn't being marked as failed. This was caused by a race along the
lines of the following:
T1: Nears completion to copy tablet A.
T2: Begins to receive a copy of data for tablet B.
T1: Hits a disk failure on /data/1.
T1: Fails all the tablets in /data/1. While /data/1 is registered
with tablet B, the replica for B is not yet registered.
T2: Registers tablet B with the tablet manager.
T2: The copy fails because tablet B is in a failed directory.
T2: Data for failed copy of B is cleaned up, but the replica is never
marked as failed. Instead, it is never bootstrapped, and is left in
the INITIALIZED state.
Note: the race doesn't need to be two tablet copies racing -- it could
be a copy and any other concurrent IO.
The fix is to ensure that tablet B fails itself in case it fails, as we
do elsewhere in the copy/bootstrap process.
I tweaked TabletCopyClientSessionITest.TestStopCopyOnClientDiskFailure
to see this race, and I saw it 2/1000 times. With this patch, it passed
5000/5000 times.
Change-Id: Ie270f435174fb8fba2adea21a5fbb48f3e56e5cb
Reviewed-on: http://gerrit.cloudera.org:8080/15028
Reviewed-by: Adar Dembo <[email protected]>
Tested-by: Kudu Jenkins
---
M src/kudu/integration-tests/tablet_copy_client_session-itest.cc
M src/kudu/tserver/ts_tablet_manager.cc
2 files changed, 35 insertions(+), 33 deletions(-)
Approvals:
Adar Dembo: Looks good to me, approved
Kudu Jenkins: Verified
--
To view, visit http://gerrit.cloudera.org:8080/15028
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: Ie270f435174fb8fba2adea21a5fbb48f3e56e5cb
Gerrit-Change-Number: 15028
Gerrit-PatchSet: 3
Gerrit-Owner: Andrew Wong <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Andrew Wong <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)