[
https://issues.apache.org/jira/browse/KUDU-38?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16977775#comment-16977775
]
Todd Lipcon commented on KUDU-38:
---------------------------------
bq. Guaranteeing that every complete segment has a fully sync'ed index file
makes for a nice invariant, but isn't it overkill for the task at hand?
Couldn't we get away with sync'ing whichever index file contains the earliest
anchored index at TabletMetadata flush time? I'm particularly concerned about
the backwards compatibility implications: how do we establish this invariant
after upgrading to a release including this fix? Or, how do we detect that it's
not present in existing log index files?
I think we need to make sure that all prior indexes are also synced, because
it's possible that there is a lagging peer that will still need to catch up
from a very old record. The index file is what allows a leader to go find those
old log entries and send them along. Without it, the old log segments aren't
useful.
bq. I'm particularly concerned about the backwards compatibility implications:
how do we establish this invariant after upgrading to a release including this
fix? Or, how do we detect that it's not present in existing log index files?
Yep, we'd need to take that into account, eg by adding some new flag to the
tablet metadata indicating that the indexes are durable or somesuch.
bq. Alternatively, what about forgoing the log index file and rather than
storing the earliest anchored index in the TabletMetadata, storing the
"physical index" (i.e. the LogIndexEntry corresponding to the anchor)?
Again per above, we can't be alive with an invalid index file, or else
consensus won't be happy.
----
Ignoring the rest of your questions for a minute, let me throw out an
alternative idea or two:
*Option 1:*
We could add a new separate piece of metadata next to the logs called a "sync
point" or somesuch. (this could even be at a well-known offset in the existing
log file or something). We can periodically wake up a background process in a
log (eg when we see that the sync point is too far back) and then:
(1) look up the earliest durability-anchored offset
(2) msync the log indexes up to that point
(3) write that point to the special "sync point" metadata file. This is just an
offset, so it can be written atomically and lazily flushed (it only moves
forward)
At startup, if we see a sync point metadata file, we know we can start
replaying (and reconstructing index) from that point, without having to
reconstruct any earlier index entries. If we do this lazily (eg once every few
seconds only on actively-written tablets) the performance overhead should be
negligible.
We also need to think about how this interacts with tablet copy -- right now, a
newly copied tablet relies on replaying the WALs from the beginning because it
doesn't copy log indexes. We may need to change that.
*Option 2*: get rid of "log index"
This is the "nuke everything from orbit" option: the whole log index thing was
convenient but it's somewhat annoying for a number of reasons: (1) the issues
described here, (2) we are using mmapped IO which is dangerous since IO errors
crash the process, (3) just another bit of code to worry about and transfer
around in tablet copy, etc.
The alternative is to embed the index in the WAL itself. One sketch of an
implementation would be something like:
- divide the WAL into fixed-size pages, each with a header. The header would
have term/index info and some kind of "continuation" flag for when entries span
multiple pages. This is more or less the postgres WAL design
- this allows us to binary-search the WAL instead of having a separate index.
- we have to consider how truncations work -- I guess we would move to physical
truncation.
Another possible idea would be to not use fixed-size pages, but instead embed a
tree structure into the WAL itself. For example, it wouldn't be too tough to
add a back-pointer from each entry to the previous entry to enable backward
scanning. If we then take a skip-list like approach (n/2 nodes have a skip-1
pointer, n/4 nodes have a skip-4 pointer, n/8 nodes have a skip-8 pointer, etc)
then we can get logarithmic access time to past log entries. Again, need to
consider truncation.
Either of these options have the advantage that we no longer need to worry
about indexes, but we still do need to worry about figuring out where to start
replaying from, and we could take the same strategy as the first suggestion for
that.
> bootstrap should not replay logs that are known to be fully flushed
> -------------------------------------------------------------------
>
> Key: KUDU-38
> URL: https://issues.apache.org/jira/browse/KUDU-38
> Project: Kudu
> Issue Type: Sub-task
> Components: tablet
> Affects Versions: M3
> Reporter: Todd Lipcon
> Assignee: Todd Lipcon
> Priority: Major
> Labels: data-scalability, startup-time
>
> Currently the bootstrap process will process all of the log segments,
> including those that can be trivially determined to contain only durable
> edits. This makes startup unnecessarily slow.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)