[ https://issues.apache.org/jira/browse/KUDU-38?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16978859#comment-16978859 ]
Adar Dembo commented on KUDU-38: -------------------------------- I spoke to Todd offline and he helped me see what I had been missing. The key issue is: although bootstrap need only replay those ops that are needed for durability (i.e. committed but not yet flushed), the running tablet should have all ops needed to catch up a slow peer. This means that the log index must be consistent going back to the oldest such op, and today, we ensure that consistency by rebuilding the log index as we replay the log. Meaning: # Either we replay beginning from {{min(oldest_op_needed_for_durability, oldest_op_needed_for_slow_peers)}} to establish a consistent log index knowing that this may be overkill in terms of how much we replay, or # We replay from {{oldest_op_needed_for_durability}} and ensure the log index is consistent going back to {{oldest_op_needed_for_slow_peers}} some other way, e.g. by syncing it when rolling segments (and dealing with backwards compatibility appropriately). A couple other things about slow peers: * At bootstrap time we don't actually know what {{oldest_op_needed_for_slow_peers}} is (it's a purely in-memory construct), but we know that, prior to crashing, we wouldn't have GC'ed a segment unless we didn't need it to catch up a peer. So the full set of ops to replay includes {{oldest_op_needed_for_slow_peer}}, and perhaps a bunch of GC-able ops before it too. * The ability to serve ops going all the way back to {{oldest_op_needed_for_slow_peers}} is best effort. It isn't strictly necessary for correctness. If we crash, restart, bootstrap, and get it wrong, the worst case scenario is a minority of replicas will be too far behind, will get evicted, and will rereplicate. Nevertheless, it's a (temporary) hit to availability and performance, so it's worth preserving as an invariant. Putting this aside for a moment, we can observe that the big win of addressing KUDU-38 is with servers hosting thousands of completely cold replicas. In such tablets, {{oldest_op_needed_for_slow_peers}} and {{oldest_op_needed_for_durability}} are identical. So if we adopted solution #1 (from the "either/or" list earlier in this comment), we address KUDU-38 for cold replicas in a way that doesn't compromise the ability to catch up slow peers, and doesn't replay more than is necessary. For hot replicas, we'll replay more than is necessary, but potentially less than what we replay today. Looking at the solution space, it's tempting to do away with the log index altogether as Todd suggested, but that's a significant undertaking. I'm also leaning against keeping the log index consistent because of the performance hit implied there: most users don't set log_force_fsync_all and thus don't fsync any log structures at all; msyncing the log index changes that. So I'm currently leaning towards storing the "physical index" of {{min(oldest_op_needed_for_durability, oldest_op_needed_for_slow_peers)}} in the TabletMetadata at flush time, which we can obtain using {{Log::GetRetentionIndexes}}. The catch is: how to exclude the anchor belonging to the current MRS/DMS being flushed? > bootstrap should not replay logs that are known to be fully flushed > ------------------------------------------------------------------- > > Key: KUDU-38 > URL: https://issues.apache.org/jira/browse/KUDU-38 > Project: Kudu > Issue Type: Sub-task > Components: tablet > Affects Versions: M3 > Reporter: Todd Lipcon > Assignee: Todd Lipcon > Priority: Major > Labels: data-scalability, startup-time > > Currently the bootstrap process will process all of the log segments, > including those that can be trivially determined to contain only durable > edits. This makes startup unnecessarily slow. -- This message was sent by Atlassian Jira (v8.3.4#803005)