[
https://issues.apache.org/jira/browse/KUDU-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17014871#comment-17014871
]
ASF subversion and git services commented on KUDU-2496:
-------------------------------------------------------
Commit 0d7ce6906d42a17a7cfabc958e672ddc39e9ea7b in kudu's branch
refs/heads/master from Andrew Wong
[ https://gitbox.apache.org/repos/asf?p=kudu.git;h=0d7ce69 ]
[tablet_copy] KUDU-2496: fail tablet when concurrent IO fails copy
I saw a precommit failure of TabletCopyClientSessionITest
TestStopCopyOnClientDiskFailure fail because the number of replicas
failed by the end of the test didn't converge to the desired number.
Digging into this more, despite every tablet spanning every data
directory, a tablet that was being copied (and eventually failed to
copy) wasn't being marked as failed. This was caused by a race along the
lines of the following:
T1: Nears completion to copy tablet A.
T2: Begins to receive a copy of data for tablet B.
T1: Hits a disk failure on /data/1.
T1: Fails all the tablets in /data/1. While /data/1 is registered
with tablet B, the replica for B is not yet registered.
T2: Registers tablet B with the tablet manager.
T2: The copy fails because tablet B is in a failed directory.
T2: Data for failed copy of B is cleaned up, but the replica is never
marked as failed. Instead, it is never bootstrapped, and is left in
the INITIALIZED state.
Note: the race doesn't need to be two tablet copies racing -- it could
be a copy and any other concurrent IO.
The fix is to ensure that tablet B fails itself in case it fails, as we
do elsewhere in the copy/bootstrap process.
I tweaked TabletCopyClientSessionITest.TestStopCopyOnClientDiskFailure
to see this race, and I saw it 2/1000 times. With this patch, it passed
5000/5000 times.
Change-Id: Ie270f435174fb8fba2adea21a5fbb48f3e56e5cb
Reviewed-on: http://gerrit.cloudera.org:8080/15028
Reviewed-by: Adar Dembo <[email protected]>
Tested-by: Kudu Jenkins
> TabletCopyClientSessionITest.TestStopCopyOnClientDiskFailure is flaky
> ---------------------------------------------------------------------
>
> Key: KUDU-2496
> URL: https://issues.apache.org/jira/browse/KUDU-2496
> Project: Kudu
> Issue Type: Bug
> Components: test
> Affects Versions: 1.8.0
> Reporter: Adar Dembo
> Assignee: Andrew Wong
> Priority: Major
> Attachments: tablet_copy_client_session-itest.txt
>
>
> This test failed in a pre-commit. I'm attaching the full test log.
> {noformat}
> I0710 23:26:48.405045 18604 tablet_copy_client_session-itest.cc:417] Waiting
> for tablets to fail: 6 / 10
> I0710 23:26:49.406461 18604 tablet_copy_client_session-itest.cc:417] Waiting
> for tablets to fail: 6 / 10
> I0710 23:26:50.407891 18604 tablet_copy_client_session-itest.cc:417] Waiting
> for tablets to fail: 6 / 10
> I0710 23:26:51.409301 18604 tablet_copy_client_session-itest.cc:417] Waiting
> for tablets to fail: 6 / 10
> I0710 23:26:52.410727 18604 tablet_copy_client_session-itest.cc:417] Waiting
> for tablets to fail: 6 / 10
> I0710 23:26:53.412380 18604 tablet_copy_client_session-itest.cc:417] Waiting
> for tablets to fail: 6 / 10
> I0710 23:26:54.413789 18604 tablet_copy_client_session-itest.cc:417] Waiting
> for tablets to fail: 6 / 10
> I0710 23:26:55.415179 18604 tablet_copy_client_session-itest.cc:417] Waiting
> for tablets to fail: 6 / 10
> I0710 23:26:56.416596 18604 tablet_copy_client_session-itest.cc:417] Waiting
> for tablets to fail: 6 / 10
> I0710 23:26:57.418123 18604 tablet_copy_client_session-itest.cc:417] Waiting
> for tablets to fail: 6 / 10
> I0710 23:26:58.419469 18604 tablet_copy_client_session-itest.cc:417] Waiting
> for tablets to fail: 6 / 10
> I0710 23:26:59.420924 18604 tablet_copy_client_session-itest.cc:417] Waiting
> for tablets to fail: 6 / 10
> I0710 23:27:00.422327 18604 tablet_copy_client_session-itest.cc:417] Waiting
> for tablets to fail: 6 / 10
> I0710 23:27:01.423743 18604 tablet_copy_client_session-itest.cc:417] Waiting
> for tablets to fail: 6 / 10
> I0710 23:27:02.425096 18604 tablet_copy_client_session-itest.cc:417] Waiting
> for tablets to fail: 6 / 10
> I0710 23:27:03.426508 18604 tablet_copy_client_session-itest.cc:417] Waiting
> for tablets to fail: 6 / 10
> I0710 23:27:04.427959 18604 tablet_copy_client_session-itest.cc:417] Waiting
> for tablets to fail: 6 / 10
> I0710 23:27:05.429391 18604 tablet_copy_client_session-itest.cc:417] Waiting
> for tablets to fail: 6 / 10
> I0710 23:27:06.430845 18604 tablet_copy_client_session-itest.cc:417] Waiting
> for tablets to fail: 6 / 10
> I0710 23:27:07.432289 18604 tablet_copy_client_session-itest.cc:417] Waiting
> for tablets to fail: 6 / 10
> I0710 23:27:08.433732 18604 tablet_copy_client_session-itest.cc:417] Waiting
> for tablets to fail: 6 / 10
> I0710 23:27:09.435497 18604 tablet_copy_client_session-itest.cc:417] Waiting
> for tablets to fail: 6 / 10
> I0710 23:27:10.436923 18604 tablet_copy_client_session-itest.cc:417] Waiting
> for tablets to fail: 6 / 10
> I0710 23:27:11.438374 18604 tablet_copy_client_session-itest.cc:417] Waiting
> for tablets to fail: 6 / 10
> I0710 23:27:12.439798 18604 tablet_copy_client_session-itest.cc:417] Waiting
> for tablets to fail: 6 / 10
> I0710 23:27:13.441087 18604 tablet_copy_client_session-itest.cc:417] Waiting
> for tablets to fail: 6 / 10
> I0710 23:27:14.442431 18604 tablet_copy_client_session-itest.cc:417] Waiting
> for tablets to fail: 6 / 10
> I0710 23:27:15.443853 18604 tablet_copy_client_session-itest.cc:417] Waiting
> for tablets to fail: 6 / 10
> I0710 23:27:16.445246 18604 tablet_copy_client_session-itest.cc:417] Waiting
> for tablets to fail: 6 / 10
> /home/jenkins-slave/workspace/kudu-master/1/src/kudu/integration-tests/tablet_copy_client_session-itest.cc:417:
> Failure
> Expected: (failed_on_ts) >= (kNumTablets - 1), actual: 6 vs 9
> /home/jenkins-slave/workspace/kudu-master/1/src/kudu/util/test_util.cc:322:
> Failure
> Failed
> Timed out waiting for assertion to pass.{noformat}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)