On Thu, Aug 18, 2016 at 7:05 AM, Dinesh Bhat <[email protected]> wrote:
> Hi Todd, > > While looking at KUDU-1408, which addresses catch-up-game between > bootstrapping a new replica and log retention on existing replicas, few > basic Qs popping up in mind: > Looking at description of LogAnchor class, what’s the purpose of log > anchor generally. > We use the term 'anchor' to mean preventing the garbage collection of old log segments. The main reasons we anchor logs are currently: 1- Durability: if there is some data that was inserted into a memory store (MRS or DMS) which has not yet been flushed, we need to retain the log segment which has the original operation so that we can replay it 2- Time: we respect a configuration --log_min_seconds_to_retain and always retain any log segment that was closed within that time bound. This is useful so that, if a follower replica crashes or freezes temporarily and then comes back within this time period, the leader will still have enough back history of logs in order to "catch up" the follower. 3- Number of segments: a configuration --log_min_segments_to_retain determines how many segments of WAL we keep regardless of the above concerns. This is useful for things like debugging, and to prevent some edge cases in the code that might happen if there are 0 WALs for a tablet. Log Anchors are primarily concerned with item #1 above (durability). They allow a memory store to register an "anchor" corresponding to the operation index of the first data item that mutated them, and that anchor is kept alive until the store is durably flushed. The registry keeps the set of live anchors so that it can quickly be interrogated during log GC. > Also, is this any different from rejoining a stale replica to the cluster > ? (stale could mean going as far back as starting a new replica ?) > Not sure exactly what you're asking here... maybe the above answers it? The time-based retention's main goal is to keep some "back history" of each tablet so that if a replica is only a little bit stale, it can restart and catch up from the leader's logs. If the leader has already GCed the necessary logs to catch up the replica (eg the replica was down for too long) then it will have to be evicted and replaced as a new replica. In that case, the new replica will start its life using the 'tablet copy' mechanism to transfer a snapshot of one of the existing replicas. The snapshot contains on-disk state as well as the WALs necessary to replay from that state. You can think of it like an rsync from a good replica. > Instead of sending anchor logs via heartbeats(as you seem to be suggesting > in the bug), could we not rely on the follower to suggest the leader where > he wants to replay the logs from ? In other words, the ’tablet copy' begins > with leader asking the follower what was his(or her ) last known index. I > am confident I am missing several pieces here. > The thing is that replication is leader-driven, and you may have a leader re-election at any point. So, if you have three nodes A, B, and C, with A as the leader, and C as a laggy replica, we want to ensure that neither A *nor B* GCs the logs that C may need to catch up. Currently there is no propagation of the information about how far behind C is back up to the WAL system, and both A and B will happily GC logs based on the factors mentioned up top in this email. One of the suggestions in KUDU-1408 is to (1) propagate info on the lagging replica's index to the leader to prevent GC on the leader, and (2) to have the leader propagate that information to other followers (like 'B' in this example) so that they also don't GC that log segment too early. Hope that helps -Todd -- Todd Lipcon Software Engineer, Cloudera
