Thanks a bunch Todd for these detailed explanations - some of those unclear Qs 
stemmed from the fact that I didn’t know the difference between resurrecting a 
stale replica vs bringing up a new replica and their relation to WALs and GCs. 
Your descriptions with  with examples provide helpful context to dwell some 
time on code reading. 

Also, I wonder if I could Ctrl+CV some of the below points to an existing 
design doc.

Thanks,

> On Aug 18, 2016, at 11:51 AM, Todd Lipcon <[email protected]> wrote:
> 
> On Thu, Aug 18, 2016 at 7:05 AM, Dinesh Bhat <[email protected]> wrote:
> 
>> Hi Todd,
>> 
>> While looking at KUDU-1408, which addresses catch-up-game between
>> bootstrapping a new replica and log retention on existing replicas, few
>> basic Qs popping up in mind:
>> Looking at description of LogAnchor class, what’s the purpose of log
>> anchor generally.
>> 
> 
> We use the term 'anchor' to mean preventing the garbage collection of old
> log segments. The main reasons we anchor logs are currently:
> 
> 1- Durability: if there is some data that was inserted into a memory store
> (MRS or DMS) which has not yet been flushed, we need to retain the log
> segment which has the original operation so that we can replay it
> 
> 2- Time: we respect a configuration --log_min_seconds_to_retain and always
> retain any log segment that was closed within that time bound. This is
> useful so that, if a follower replica crashes or freezes temporarily and
> then comes back within this time period, the leader will still have enough
> back history of logs in order to "catch up" the follower.
> 
> 3- Number of segments: a configuration --log_min_segments_to_retain
> determines how many segments of WAL we keep regardless of the above
> concerns. This is useful for things like debugging, and to prevent some
> edge cases in the code that might happen if there are 0 WALs for a tablet.
> 
> Log Anchors are primarily concerned with item #1 above (durability). They
> allow a memory store to register an "anchor" corresponding to the operation
> index of the first data item that mutated them, and that anchor is kept
> alive until the store is durably flushed. The registry keeps the set of
> live anchors so that it can quickly be interrogated during log GC.
> 
> 
> 
>> Also, is this any different from rejoining a stale replica to the cluster
>> ? (stale could mean going as far back as starting a new replica ?)
>> 
> 
> Not sure exactly what you're asking here... maybe the above answers it? The
> time-based retention's main goal is to keep some "back history" of each
> tablet so that if a replica is only a little bit stale, it can restart and
> catch up from the leader's logs. If the leader has already GCed the
> necessary logs to catch up the replica (eg the replica was down for too
> long) then it will have to be evicted and replaced as a new replica. In
> that case, the new replica will start its life using the 'tablet copy'
> mechanism to transfer a snapshot of one of the existing replicas. The
> snapshot contains on-disk state as well as the WALs necessary to replay
> from that state. You can think of it like an rsync from a good replica.
> 
> 
>> Instead of sending anchor logs via heartbeats(as you seem to be suggesting
>> in the bug), could we not rely on the follower to suggest the leader where
>> he wants to replay the logs from ? In other words, the ’tablet copy' begins
>> with leader asking the follower what was his(or her ) last known index.  I
>> am confident I am missing several pieces here.
>> 
> 
> The thing is that replication is leader-driven, and you may have a leader
> re-election at any point. So, if you have three nodes A, B, and C, with A
> as the leader, and C as a laggy replica, we want to ensure that neither A
> *nor B* GCs the logs that C may need to catch up. Currently there is no
> propagation of the information about how far behind C is back up to the WAL
> system, and both A and B will happily GC logs based on the factors
> mentioned up top in this email. One of the suggestions in KUDU-1408 is to
> (1) propagate info on the lagging replica's index to the leader to prevent
> GC on the leader, and (2) to have the leader propagate that information to
> other followers (like 'B' in this example) so that they also don't GC that
> log segment too early.
> 
> Hope that helps
> -Todd
> -- 
> Todd Lipcon
> Software Engineer, Cloudera

Reply via email to