[ https://issues.apache.org/jira/browse/KUDU-38?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16977775#comment-16977775 ]
Todd Lipcon commented on KUDU-38: --------------------------------- bq. Guaranteeing that every complete segment has a fully sync'ed index file makes for a nice invariant, but isn't it overkill for the task at hand? Couldn't we get away with sync'ing whichever index file contains the earliest anchored index at TabletMetadata flush time? I'm particularly concerned about the backwards compatibility implications: how do we establish this invariant after upgrading to a release including this fix? Or, how do we detect that it's not present in existing log index files? I think we need to make sure that all prior indexes are also synced, because it's possible that there is a lagging peer that will still need to catch up from a very old record. The index file is what allows a leader to go find those old log entries and send them along. Without it, the old log segments aren't useful. bq. I'm particularly concerned about the backwards compatibility implications: how do we establish this invariant after upgrading to a release including this fix? Or, how do we detect that it's not present in existing log index files? Yep, we'd need to take that into account, eg by adding some new flag to the tablet metadata indicating that the indexes are durable or somesuch. bq. Alternatively, what about forgoing the log index file and rather than storing the earliest anchored index in the TabletMetadata, storing the "physical index" (i.e. the LogIndexEntry corresponding to the anchor)? Again per above, we can't be alive with an invalid index file, or else consensus won't be happy. ---- Ignoring the rest of your questions for a minute, let me throw out an alternative idea or two: *Option 1:* We could add a new separate piece of metadata next to the logs called a "sync point" or somesuch. (this could even be at a well-known offset in the existing log file or something). We can periodically wake up a background process in a log (eg when we see that the sync point is too far back) and then: (1) look up the earliest durability-anchored offset (2) msync the log indexes up to that point (3) write that point to the special "sync point" metadata file. This is just an offset, so it can be written atomically and lazily flushed (it only moves forward) At startup, if we see a sync point metadata file, we know we can start replaying (and reconstructing index) from that point, without having to reconstruct any earlier index entries. If we do this lazily (eg once every few seconds only on actively-written tablets) the performance overhead should be negligible. We also need to think about how this interacts with tablet copy -- right now, a newly copied tablet relies on replaying the WALs from the beginning because it doesn't copy log indexes. We may need to change that. *Option 2*: get rid of "log index" This is the "nuke everything from orbit" option: the whole log index thing was convenient but it's somewhat annoying for a number of reasons: (1) the issues described here, (2) we are using mmapped IO which is dangerous since IO errors crash the process, (3) just another bit of code to worry about and transfer around in tablet copy, etc. The alternative is to embed the index in the WAL itself. One sketch of an implementation would be something like: - divide the WAL into fixed-size pages, each with a header. The header would have term/index info and some kind of "continuation" flag for when entries span multiple pages. This is more or less the postgres WAL design - this allows us to binary-search the WAL instead of having a separate index. - we have to consider how truncations work -- I guess we would move to physical truncation. Another possible idea would be to not use fixed-size pages, but instead embed a tree structure into the WAL itself. For example, it wouldn't be too tough to add a back-pointer from each entry to the previous entry to enable backward scanning. If we then take a skip-list like approach (n/2 nodes have a skip-1 pointer, n/4 nodes have a skip-4 pointer, n/8 nodes have a skip-8 pointer, etc) then we can get logarithmic access time to past log entries. Again, need to consider truncation. Either of these options have the advantage that we no longer need to worry about indexes, but we still do need to worry about figuring out where to start replaying from, and we could take the same strategy as the first suggestion for that. > bootstrap should not replay logs that are known to be fully flushed > ------------------------------------------------------------------- > > Key: KUDU-38 > URL: https://issues.apache.org/jira/browse/KUDU-38 > Project: Kudu > Issue Type: Sub-task > Components: tablet > Affects Versions: M3 > Reporter: Todd Lipcon > Assignee: Todd Lipcon > Priority: Major > Labels: data-scalability, startup-time > > Currently the bootstrap process will process all of the log segments, > including those that can be trivially determined to contain only durable > edits. This makes startup unnecessarily slow. -- This message was sent by Atlassian Jira (v8.3.4#803005)