[
https://issues.apache.org/jira/browse/KUDU-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15425113#comment-15425113
]
Todd Lipcon commented on KUDU-1408:
-----------------------------------
I chatted with the CockroachDB folks to see how they handle raft log retention
and got a pointer to their truncation computation code at:
https://github.com/cockroachdb/cockroach/blob/2504845195388d0e5405f061a149e76bfcfb9302/storage/raft_log_queue.go#L111
Their raft log is completely layered distinctly from their WAL, so they dont
need to mix and match database replay concerns with Raft retention concerns.
So, their implementation is relatively simpler to understand. The summary of
their strategy is:
- *Policy*:
-- they retain only the logs necessary to catch up the farthest-behind follower
-- *unless* that size is so large that they're retaining more logs than the
total size of the on-disk data (which likely includes their DB WALs)
--- the logic here is that it's cheaper to send a new full copy of the data
rather than keep catching up by WALs at that point
--- in that case, they truncate to the next farthest behind follower which
would be less than the total on-disk size, or to the committed index, whichever
is more
- *Implementation*
-- only the leader runs the above policy
-- when it decides to run a truncation, it proposes a TruncateLog command as a
normal raft operation. The followers truncate their own log in response.
Essentially this is my first solution proposed above, but solving the "cons" by
limiting how far a follower can lag based on the ratio of log size vs data
size. It also has the advantage of not having any complex heuristics --
heuristics, sure, but easy-to-follow ones.
I think I'll try hacking something like this together and trying on a cluster
to see if it improves stability during a heavy write workload.
> Adding a replica may never succeed if copying tablet takes longer than the
> log retention time
> ---------------------------------------------------------------------------------------------
>
> Key: KUDU-1408
> URL: https://issues.apache.org/jira/browse/KUDU-1408
> Project: Kudu
> Issue Type: Bug
> Components: consensus, tserver
> Affects Versions: 0.8.0
> Reporter: Todd Lipcon
> Priority: Critical
>
> Currently, while a remote bootstrap session is in progress, we anchor the
> logs from the time at which it started. However, as soon as the session
> finishes, we drop the anchor, and delete any logs. In the case where the
> tablet copy itself takes longer than the log retention period, this means
> it's likely to have a scenario like:
> - TS A starts downloading from TS B. It plans to download segments 1-4 and
> adds an anchor.
> - TS B handles writes for 20 minutes, rolling the log many times (e.g. up to
> log segment 20)
> - TS A finishes downloading, and ends the remote bootstrap session
> - TS B no longer has an anchor, so GCs all logs 1-16.
> - TS A finishes opening the tablet it just copied, but immediately is unable
> to catch up (because it only has segments 1-4, but the leader only has 17-20)
> - TS B evicts TS A
> This loop will go on basically forever until the write workload stops on TS B.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)