Thanks a bunch Todd for these detailed explanations - some of those unclear Qs stemmed from the fact that I didn’t know the difference between resurrecting a stale replica vs bringing up a new replica and their relation to WALs and GCs. Your descriptions with with examples provide helpful context to dwell some time on code reading.
Also, I wonder if I could Ctrl+CV some of the below points to an existing design doc. Thanks, > On Aug 18, 2016, at 11:51 AM, Todd Lipcon <[email protected]> wrote: > > On Thu, Aug 18, 2016 at 7:05 AM, Dinesh Bhat <[email protected]> wrote: > >> Hi Todd, >> >> While looking at KUDU-1408, which addresses catch-up-game between >> bootstrapping a new replica and log retention on existing replicas, few >> basic Qs popping up in mind: >> Looking at description of LogAnchor class, what’s the purpose of log >> anchor generally. >> > > We use the term 'anchor' to mean preventing the garbage collection of old > log segments. The main reasons we anchor logs are currently: > > 1- Durability: if there is some data that was inserted into a memory store > (MRS or DMS) which has not yet been flushed, we need to retain the log > segment which has the original operation so that we can replay it > > 2- Time: we respect a configuration --log_min_seconds_to_retain and always > retain any log segment that was closed within that time bound. This is > useful so that, if a follower replica crashes or freezes temporarily and > then comes back within this time period, the leader will still have enough > back history of logs in order to "catch up" the follower. > > 3- Number of segments: a configuration --log_min_segments_to_retain > determines how many segments of WAL we keep regardless of the above > concerns. This is useful for things like debugging, and to prevent some > edge cases in the code that might happen if there are 0 WALs for a tablet. > > Log Anchors are primarily concerned with item #1 above (durability). They > allow a memory store to register an "anchor" corresponding to the operation > index of the first data item that mutated them, and that anchor is kept > alive until the store is durably flushed. The registry keeps the set of > live anchors so that it can quickly be interrogated during log GC. > > > >> Also, is this any different from rejoining a stale replica to the cluster >> ? (stale could mean going as far back as starting a new replica ?) >> > > Not sure exactly what you're asking here... maybe the above answers it? The > time-based retention's main goal is to keep some "back history" of each > tablet so that if a replica is only a little bit stale, it can restart and > catch up from the leader's logs. If the leader has already GCed the > necessary logs to catch up the replica (eg the replica was down for too > long) then it will have to be evicted and replaced as a new replica. In > that case, the new replica will start its life using the 'tablet copy' > mechanism to transfer a snapshot of one of the existing replicas. The > snapshot contains on-disk state as well as the WALs necessary to replay > from that state. You can think of it like an rsync from a good replica. > > >> Instead of sending anchor logs via heartbeats(as you seem to be suggesting >> in the bug), could we not rely on the follower to suggest the leader where >> he wants to replay the logs from ? In other words, the ’tablet copy' begins >> with leader asking the follower what was his(or her ) last known index. I >> am confident I am missing several pieces here. >> > > The thing is that replication is leader-driven, and you may have a leader > re-election at any point. So, if you have three nodes A, B, and C, with A > as the leader, and C as a laggy replica, we want to ensure that neither A > *nor B* GCs the logs that C may need to catch up. Currently there is no > propagation of the information about how far behind C is back up to the WAL > system, and both A and B will happily GC logs based on the factors > mentioned up top in this email. One of the suggestions in KUDU-1408 is to > (1) propagate info on the lagging replica's index to the leader to prevent > GC on the leader, and (2) to have the leader propagate that information to > other followers (like 'B' in this example) so that they also don't GC that > log segment too early. > > Hope that helps > -Todd > -- > Todd Lipcon > Software Engineer, Cloudera
