Hello Mike Percy, Alexey Serbin,
I'd like you to do a code review. Please visit
http://gerrit.cloudera.org:8080/7337
to review the following change.
Change subject: raft_consensus-itest: fix failing
TestMemoryRemainsConstantDespiteTwoDeadFollowers
......................................................................
raft_consensus-itest: fix failing
TestMemoryRemainsConstantDespiteTwoDeadFollowers
This test started failing recently with a seemingly unrelated change to
the minicluster (e6758739a90adeb2e4d0c6cf76185cf90cc7d2b0). It's still a
mystery as to the relationship here, after a couple hours of debugging.
But, looking at the failures with some extra tracing added, I determined
that the issue was something like the following:
- the test sets up a scenario in which writes are expected to time out
by killing 2/3 replicas
- due to recent client changes by Alexey, the client now considers a
Timeout error on writes to mean that the server is dead, and marks it
as no longer a leader in the cache
- this causes the next write on other threads to go to the master to
lookup the "new location" assuming that a leader election probably has
happened
- in fact, _all_ of the threads quickly go to the master to perform the
same lookup, exhausing the "master lookup permits" counter. Any excess
threads beyond that count end up backing off sleeping
- the master lookups in TSAN builds can sometimes take tens of
milliseconds, especially in the GCE test environment which uses low
core-count workers. Thus, some of the lookups to the masters
themselves time out.
- those that do succeed, of course, return the same location information
as we had before
- any writers which happen to wake up from back-offs to get new location
information then try to do a write, but it's quick likely that before
they get a chance to do so, another thread has already experienced
another timeout and marked the replica as bad.
Essentially, the test can get into a kind of spiral where it's flooding
the master with lookups, each of which is taking longer than 50ms, and
thus cause more timeouts, which cause more lookups, etc.
As identified elsewhere, the metacache probably needs a pretty major
re-working to fix these sorts of problems, but this scenario is also
fairly contrived. So, this patch just bumps the timeout to 150ms instead
of 50ms, and changes the payload of the writes to be much larger, so the
desired backpressure kicks in faster.
Before this change, TSAN test runs on gce were failing nearly 100%. With
the change I passed 100/100.
Change-Id: Ifa7d3d7655c2ecf376e894b8a1412e2fe3df0753
---
M src/kudu/integration-tests/raft_consensus-itest.cc
1 file changed, 2 insertions(+), 1 deletion(-)
git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/37/7337/1
--
To view, visit http://gerrit.cloudera.org:8080/7337
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-MessageType: newchange
Gerrit-Change-Id: Ifa7d3d7655c2ecf376e894b8a1412e2fe3df0753
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <[email protected]>
Gerrit-Reviewer: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Mike Percy <[email protected]>