Hello Todd Lipcon, Kudu Jenkins,
I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/7735
to look at the new patch set (#4).
Change subject: consensus: use periodic timers for failure detection
......................................................................
consensus: use periodic timers for failure detection
This patch replaces the existing failure detection (FD) with a new approach
built using periodic timers. The existing approach had a major drawback:
each failure monitor required a dedicated thread, and there was a monitor
for each replica.
The new approach "schedules" a failure into the future using the server's
reactor thread pool, "resetting" it when leader activity is detected.
There's an inherent semantic mismatch between dedicated threads that
periodically wake to check for failures and this new approach; I tried to
provide similar semantics as best I could.
Things worth noting:
- Most importantly: some FD periods are now shorter. This is because the
existing implementation "double counted" failure periods when adding
backoff (once in LeaderElectionExpBackoffDeltaUnlocked, and once by virtue
of the failure period comparison made by the failure monitor). This seemed
accidental to me, so I didn't bother preserving that behavior.
- It's tough to "expire" an FD using timers. Luckily, this only happens in
RaftConsensus::Start, so by making PeriodicTimer::Start accept an optional
delta, we can begin FD with an early delta that reflects the desired
"detect a failure immediately but not too quickly" semantic, similar to
how the dedicated failure monitor thread operates.
- ReportFailureDetected is now run on a shared reactor thread rather than a
dedicated failure monitor thread. Since StartElection performs IO, I
thunked it onto the Raft thread pool.
- Timer operations cannot fail, so I removed the return values from the
various FD-related functions.
- I also consolidated the two SnoozeFailureDetector variants; I found that
this made it easier to look at all the call-sites.
Change-Id: I8acdb44e12b975fda4a226aa784db95bc7b4e330
---
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
2 files changed, 111 insertions(+), 139 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/35/7735/4
--
To view, visit http://gerrit.cloudera.org:8080/7735
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8acdb44e12b975fda4a226aa784db95bc7b4e330
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo <[email protected]>
Gerrit-Reviewer: Adar Dembo <[email protected]>
Gerrit-Reviewer: Dan Burkert <[email protected]>
Gerrit-Reviewer: David Ribeiro Alves <[email protected]>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <[email protected]>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <[email protected]>