[
https://issues.apache.org/jira/browse/KUDU-38?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16978859#comment-16978859
]
Adar Dembo commented on KUDU-38:
--------------------------------
I spoke to Todd offline and he helped me see what I had been missing.
The key issue is: although bootstrap need only replay those ops that are needed
for durability (i.e. committed but not yet flushed), the running tablet should
have all ops needed to catch up a slow peer. This means that the log index must
be consistent going back to the oldest such op, and today, we ensure that
consistency by rebuilding the log index as we replay the log. Meaning:
# Either we replay beginning from {{min(oldest_op_needed_for_durability,
oldest_op_needed_for_slow_peers)}} to establish a consistent log index knowing
that this may be overkill in terms of how much we replay, or
# We replay from {{oldest_op_needed_for_durability}} and ensure the log index
is consistent going back to {{oldest_op_needed_for_slow_peers}} some other way,
e.g. by syncing it when rolling segments (and dealing with backwards
compatibility appropriately).
A couple other things about slow peers:
* At bootstrap time we don't actually know what
{{oldest_op_needed_for_slow_peers}} is (it's a purely in-memory construct), but
we know that, prior to crashing, we wouldn't have GC'ed a segment unless we
didn't need it to catch up a peer. So the full set of ops to replay includes
{{oldest_op_needed_for_slow_peer}}, and perhaps a bunch of GC-able ops before
it too.
* The ability to serve ops going all the way back to
{{oldest_op_needed_for_slow_peers}} is best effort. It isn't strictly necessary
for correctness. If we crash, restart, bootstrap, and get it wrong, the worst
case scenario is a minority of replicas will be too far behind, will get
evicted, and will rereplicate. Nevertheless, it's a (temporary) hit to
availability and performance, so it's worth preserving as an invariant.
Putting this aside for a moment, we can observe that the big win of addressing
KUDU-38 is with servers hosting thousands of completely cold replicas. In such
tablets, {{oldest_op_needed_for_slow_peers}} and
{{oldest_op_needed_for_durability}} are identical. So if we adopted solution #1
(from the "either/or" list earlier in this comment), we address KUDU-38 for
cold replicas in a way that doesn't compromise the ability to catch up slow
peers, and doesn't replay more than is necessary. For hot replicas, we'll
replay more than is necessary, but potentially less than what we replay today.
Looking at the solution space, it's tempting to do away with the log index
altogether as Todd suggested, but that's a significant undertaking. I'm also
leaning against keeping the log index consistent because of the performance hit
implied there: most users don't set log_force_fsync_all and thus don't fsync
any log structures at all; msyncing the log index changes that. So I'm
currently leaning towards storing the "physical index" of
{{min(oldest_op_needed_for_durability, oldest_op_needed_for_slow_peers)}} in
the TabletMetadata at flush time, which we can obtain using
{{Log::GetRetentionIndexes}}. The catch is: how to exclude the anchor belonging
to the current MRS/DMS being flushed?
> bootstrap should not replay logs that are known to be fully flushed
> -------------------------------------------------------------------
>
> Key: KUDU-38
> URL: https://issues.apache.org/jira/browse/KUDU-38
> Project: Kudu
> Issue Type: Sub-task
> Components: tablet
> Affects Versions: M3
> Reporter: Todd Lipcon
> Assignee: Todd Lipcon
> Priority: Major
> Labels: data-scalability, startup-time
>
> Currently the bootstrap process will process all of the log segments,
> including those that can be trivially determined to contain only durable
> edits. This makes startup unnecessarily slow.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)