Hello Todd Lipcon, Kudu Jenkins,
I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/6925
to look at the new patch set (#6).
Change subject: KUDU-2020: tserver failure causes multiple tablet copy
operations per under-replicated tablet
......................................................................
KUDU-2020: tserver failure causes multiple tablet copy operations per
under-replicated tablet
The 'active ingredient' in this patch is the change to
TsTabletManager::StartTabletCopy that causes an ALREADY_INPROGRESS
response to be returned if the tablet is currently being copied and the
tablet copy thread pool is full. Previously an ALREADY_INPROGRESS
response would only occur if the tablet was currently being copied, and
the threadpool was not full.
The effect of the failure to return ALREADY_INPROGRESS was that a leader
would be much more likely consider a tablet server failed and to
subsequently drop the replica from the Raft config. As a result, on a
highly loaded cluster, a tablet copy could be started at the same time,
300 seconds apart, on many tablet servers.
The remaining changes are to to return more specific errors out of the
tablet copy service, which aids with testing specific codepaths. One of
the existing tablet_copy-itest cases has been beefed up to cover this
specific regression. Without the changes outlined before it fails with:
../../src/kudu/integration-tests/tablet_copy-itest.cc:961: Failure
Expected: (num_inprogress) > (0), actual: 0 vs 0
which is exactly what we would expect; the tablet server is failing to
return INPROGRESS errors.
Anecdotally, this patch has improved TTR times 5-10x on highly loaded
clusters. It's still possible for tablets to be bounced around during
re-replication if the copying tablet server has a full RPC queue, or
it's unable to start the tablet copy for 300 seconds, but both of these
conditions indicate that it's probably best to drop that tserver and
retry on a (hopefully) less stressed server.
Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
---
M src/kudu/consensus/consensus_peers.cc
M src/kudu/integration-tests/cluster_itest_util.cc
M src/kudu/integration-tests/cluster_itest_util.h
M src/kudu/integration-tests/tablet_copy-itest.cc
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/ts_tablet_manager.cc
6 files changed, 124 insertions(+), 42 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/25/6925/6
--
To view, visit http://gerrit.cloudera.org:8080/6925
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Iffa1f0fec4e882beabfee6e0f2672096caccdf75
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dan Burkert <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Dan Burkert <[email protected]>
Gerrit-Reviewer: David Ribeiro Alves <[email protected]>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <[email protected]>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <[email protected]>