Re: Few Qs on log retention

Todd Lipcon Thu, 18 Aug 2016 11:52:48 -0700

On Thu, Aug 18, 2016 at 7:05 AM, Dinesh Bhat <[email protected]> wrote:


> Hi Todd,
>
> While looking at KUDU-1408, which addresses catch-up-game between
> bootstrapping a new replica and log retention on existing replicas, few
> basic Qs popping up in mind:
> Looking at description of LogAnchor class, what’s the purpose of log
> anchor generally.
>

We use the term 'anchor' to mean preventing the garbage collection of old
log segments. The main reasons we anchor logs are currently:

1- Durability: if there is some data that was inserted into a memory store
(MRS or DMS) which has not yet been flushed, we need to retain the log
segment which has the original operation so that we can replay it

2- Time: we respect a configuration --log_min_seconds_to_retain and always
retain any log segment that was closed within that time bound. This is
useful so that, if a follower replica crashes or freezes temporarily and
then comes back within this time period, the leader will still have enough
back history of logs in order to "catch up" the follower.

3- Number of segments: a configuration --log_min_segments_to_retain
determines how many segments of WAL we keep regardless of the above
concerns. This is useful for things like debugging, and to prevent some
edge cases in the code that might happen if there are 0 WALs for a tablet.

Log Anchors are primarily concerned with item #1 above (durability). They
allow a memory store to register an "anchor" corresponding to the operation
index of the first data item that mutated them, and that anchor is kept
alive until the store is durably flushed. The registry keeps the set of
live anchors so that it can quickly be interrogated during log GC.



> Also, is this any different from rejoining a stale replica to the cluster
> ? (stale could mean going as far back as starting a new replica ?)
>

Not sure exactly what you're asking here... maybe the above answers it? The
time-based retention's main goal is to keep some "back history" of each
tablet so that if a replica is only a little bit stale, it can restart and
catch up from the leader's logs. If the leader has already GCed the
necessary logs to catch up the replica (eg the replica was down for too
long) then it will have to be evicted and replaced as a new replica. In
that case, the new replica will start its life using the 'tablet copy'
mechanism to transfer a snapshot of one of the existing replicas. The
snapshot contains on-disk state as well as the WALs necessary to replay
from that state. You can think of it like an rsync from a good replica.


> Instead of sending anchor logs via heartbeats(as you seem to be suggesting
> in the bug), could we not rely on the follower to suggest the leader where
> he wants to replay the logs from ? In other words, the ’tablet copy' begins
> with leader asking the follower what was his(or her ) last known index.  I
> am confident I am missing several pieces here.
>

The thing is that replication is leader-driven, and you may have a leader
re-election at any point. So, if you have three nodes A, B, and C, with A
as the leader, and C as a laggy replica, we want to ensure that neither A
*nor B* GCs the logs that C may need to catch up. Currently there is no
propagation of the information about how far behind C is back up to the WAL
system, and both A and B will happily GC logs based on the factors
mentioned up top in this email. One of the suggestions in KUDU-1408 is to
(1) propagate info on the lagging replica's index to the leader to prevent
GC on the leader, and (2) to have the leader propagate that information to
other followers (like 'B' in this example) so that they also don't GC that
log segment too early.

Hope that helps
-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Few Qs on log retention

Reply via email to