Hello Kudu Jenkins, Adar Dembo,

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/15028

to look at the new patch set (#2).

Change subject: [tablet_copy] KUDU-2496: fail tablet when concurrent IO fails 
copy
......................................................................

[tablet_copy] KUDU-2496: fail tablet when concurrent IO fails copy

I saw a precommit failure of TabletCopyClientSessionITest
TestStopCopyOnClientDiskFailure fail because the number of replicas
failed by the end of the test didn't converge to the desired number.

Digging into this more, despite every tablet spanning every data
directory, a tablet that was being copied (and eventually failed to
copy) wasn't being marked as failed. This was caused by a race along the
lines of the following:

T1: Nears completion to copy tablet A.
T2: Begins to receive a copy of data for tablet B.
T1: Hits a disk failure on /data/1.
T1: Fails all the tablets in /data/1. While /data/1 is registered
    with tablet B, the replica for B is not yet registered.
T2: Registers tablet B with the tablet manager.
T2: The copy fails because tablet B is in a failed directory.
T2: Data for failed copy of B is cleaned up, but the replica is never
    marked as failed. Instead, it is never bootstrapped, and is left in
    the INITIALIZED state.
Note: the race doesn't need to be two tablet copies racing -- it could
be a copy and any other concurrent IO.

The fix is to ensure that tablet B fails itself in case it fails, as we
do elsewhere in the copy/bootstrap process.

I tweaked TabletCopyClientSessionITest.TestStopCopyOnClientDiskFailure
to see this race, and I saw it 2/1000 times. With this patch, it passed
5000/5000 times.

Change-Id: Ie270f435174fb8fba2adea21a5fbb48f3e56e5cb
---
M src/kudu/integration-tests/tablet_copy_client_session-itest.cc
M src/kudu/tserver/ts_tablet_manager.cc
2 files changed, 35 insertions(+), 33 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/28/15028/2
--
To view, visit http://gerrit.cloudera.org:8080/15028
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ie270f435174fb8fba2adea21a5fbb48f3e56e5cb
Gerrit-Change-Number: 15028
Gerrit-PatchSet: 2
Gerrit-Owner: Andrew Wong <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Andrew Wong <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)

Reply via email to