[ 
https://issues.apache.org/jira/browse/KUDU-38?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16978859#comment-16978859
 ] 

Adar Dembo commented on KUDU-38:
--------------------------------

I spoke to Todd offline and he helped me see what I had been missing.

The key issue is: although bootstrap need only replay those ops that are needed 
for durability (i.e. committed but not yet flushed), the running tablet should 
have all ops needed to catch up a slow peer. This means that the log index must 
be consistent going back to the oldest such op, and today, we ensure that 
consistency by rebuilding the log index as we replay the log. Meaning:
# Either we replay beginning from {{min(oldest_op_needed_for_durability, 
oldest_op_needed_for_slow_peers)}} to establish a consistent log index knowing 
that this may be overkill in terms of how much we replay, or
# We replay from {{oldest_op_needed_for_durability}} and ensure the log index 
is consistent going back to {{oldest_op_needed_for_slow_peers}} some other way, 
e.g. by syncing it when rolling segments (and dealing with backwards 
compatibility appropriately).

A couple other things about slow peers:
* At bootstrap time we don't actually know what 
{{oldest_op_needed_for_slow_peers}} is (it's a purely in-memory construct), but 
we know that, prior to crashing, we wouldn't have GC'ed a segment unless we 
didn't need it to catch up a peer. So the full set of ops to replay includes 
{{oldest_op_needed_for_slow_peer}}, and perhaps a bunch of GC-able ops before 
it too.
* The ability to serve ops going all the way back to 
{{oldest_op_needed_for_slow_peers}} is best effort. It isn't strictly necessary 
for correctness. If we crash, restart, bootstrap, and get it wrong, the worst 
case scenario is a minority of replicas will be too far behind, will get 
evicted, and will rereplicate. Nevertheless, it's a (temporary) hit to 
availability and performance, so it's worth preserving as an invariant.

Putting this aside for a moment, we can observe that the big win of addressing 
KUDU-38 is with servers hosting thousands of completely cold replicas. In such 
tablets, {{oldest_op_needed_for_slow_peers}} and 
{{oldest_op_needed_for_durability}} are identical. So if we adopted solution #1 
(from the "either/or" list earlier in this comment), we address KUDU-38 for 
cold replicas in a way that doesn't compromise the ability to catch up slow 
peers, and doesn't replay more than is necessary. For hot replicas, we'll 
replay more than is necessary, but potentially less than what we replay today.

Looking at the solution space, it's tempting to do away with the log index 
altogether as Todd suggested, but that's a significant undertaking. I'm also 
leaning against keeping the log index consistent because of the performance hit 
implied there: most users don't set log_force_fsync_all and thus don't fsync 
any log structures at all; msyncing the log index changes that. So I'm 
currently leaning towards storing the "physical index" of 
{{min(oldest_op_needed_for_durability, oldest_op_needed_for_slow_peers)}} in 
the TabletMetadata at flush time, which we can obtain using 
{{Log::GetRetentionIndexes}}. The catch is: how to exclude the anchor belonging 
to the current MRS/DMS being flushed?

> bootstrap should not replay logs that are known to be fully flushed
> -------------------------------------------------------------------
>
>                 Key: KUDU-38
>                 URL: https://issues.apache.org/jira/browse/KUDU-38
>             Project: Kudu
>          Issue Type: Sub-task
>          Components: tablet
>    Affects Versions: M3
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>            Priority: Major
>              Labels: data-scalability, startup-time
>
> Currently the bootstrap process will process all of the log segments, 
> including those that can be trivially determined to contain only durable 
> edits. This makes startup unnecessarily slow.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to