[ 
https://issues.apache.org/jira/browse/KUDU-38?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16977775#comment-16977775
 ] 

Todd Lipcon commented on KUDU-38:
---------------------------------

bq. Guaranteeing that every complete segment has a fully sync'ed index file 
makes for a nice invariant, but isn't it overkill for the task at hand? 
Couldn't we get away with sync'ing whichever index file contains the earliest 
anchored index at TabletMetadata flush time? I'm particularly concerned about 
the backwards compatibility implications: how do we establish this invariant 
after upgrading to a release including this fix? Or, how do we detect that it's 
not present in existing log index files?

I think we need to make sure that all prior indexes are also synced, because 
it's possible that there is a lagging peer that will still need to catch up 
from a very old record. The index file is what allows a leader to go find those 
old log entries and send them along. Without it, the old log segments aren't 
useful.

bq. I'm particularly concerned about the backwards compatibility implications: 
how do we establish this invariant after upgrading to a release including this 
fix? Or, how do we detect that it's not present in existing log index files?

Yep, we'd need to take that into account, eg by adding some new flag to the 
tablet metadata indicating that the indexes are durable or somesuch.

bq. Alternatively, what about forgoing the log index file and rather than 
storing the earliest anchored index in the TabletMetadata, storing the 
"physical index" (i.e. the LogIndexEntry corresponding to the anchor)?

Again per above, we can't be alive with an invalid index file, or else 
consensus won't be happy.

----
Ignoring the rest of your questions for a minute, let me throw out an 
alternative idea or two:

*Option 1:*

We could add a new separate piece of metadata next to the logs called a "sync 
point" or somesuch. (this could even be at a well-known offset in the existing 
log file or something). We can periodically wake up a background process in a 
log (eg when we see that the sync point is too far back) and then:
(1) look up the earliest durability-anchored offset
(2) msync the log indexes up to that point
(3) write that point to the special "sync point" metadata file. This is just an 
offset, so it can be written atomically and lazily flushed (it only moves 
forward)

At startup, if we see a sync point metadata file, we know we can start 
replaying (and reconstructing index) from that point, without having to 
reconstruct any earlier index entries. If we do this lazily (eg once every few 
seconds only on actively-written tablets) the performance overhead should be 
negligible.

We also need to think about how this interacts with tablet copy -- right now, a 
newly copied tablet relies on replaying the WALs from the beginning because it 
doesn't copy log indexes. We may need to change that.

*Option 2*: get rid of "log index"
This is the "nuke everything from orbit" option: the whole log index thing was 
convenient but it's somewhat annoying for a number of reasons: (1) the issues 
described here, (2) we are using mmapped IO which is dangerous since IO errors 
crash the process, (3) just another bit of code to worry about and transfer 
around in tablet copy, etc.

The alternative is to embed the index in the WAL itself. One sketch of an 
implementation would be something like:
- divide the WAL into fixed-size pages, each with a header.  The header would 
have term/index info and some kind of "continuation" flag for when entries span 
multiple pages. This is more or less the postgres WAL design
- this allows us to binary-search the WAL instead of having a separate index.
- we have to consider how truncations work -- I guess we would move to physical 
truncation.

Another possible idea would be to not use fixed-size pages, but instead embed a 
tree structure into the WAL itself. For example, it wouldn't be too tough to 
add a back-pointer from each entry to the previous entry to enable backward 
scanning. If we then take a skip-list like approach (n/2 nodes have a skip-1 
pointer, n/4 nodes have a skip-4 pointer, n/8 nodes have a skip-8 pointer, etc) 
then we can get logarithmic access time to past log entries. Again, need to 
consider truncation.

Either of these options have the advantage that we no longer need to worry 
about indexes, but we still do need to worry about figuring out where to start 
replaying from, and we could take the same strategy as the first suggestion for 
that.

> bootstrap should not replay logs that are known to be fully flushed
> -------------------------------------------------------------------
>
>                 Key: KUDU-38
>                 URL: https://issues.apache.org/jira/browse/KUDU-38
>             Project: Kudu
>          Issue Type: Sub-task
>          Components: tablet
>    Affects Versions: M3
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>            Priority: Major
>              Labels: data-scalability, startup-time
>
> Currently the bootstrap process will process all of the log segments, 
> including those that can be trivially determined to contain only durable 
> edits. This makes startup unnecessarily slow.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to