[ https://issues.apache.org/jira/browse/CASSANDRA-13987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16277025#comment-16277025 ]
Sam Tunnicliffe commented on CASSANDRA-13987: --------------------------------------------- bq. I've added CommitLogChainedMarkersTest. Looking at it now, perhaps it could do with a better name and/or a comment at the top of the file explaining what, specifically, it's testing. Thanks for adding the test, I definitely agree that an explanatory comment would be useful, took me a few minutes to figure out what was going on there :) Few more minor-ish comments: * The validation on {{markerIntervalMillis}} in the latest commit on 3.0/3.11 also needs adding to trunk. Without it, use of the default value results in wakeup time always being before {{now}} * In ACLS, Can you rename {{requestSync}} to something like {{syncRequested}} or {{explicitSyncReqested}} to be consistent with {{shutdownRequested}}? * In trunk, the comment near the top of {{CLS::sync}} is missing a word, I'm guessing it should read "this will evaluate to true *when* the allocate position is jacked...". Also, this comment is missing on 3.0/3.11 * Also in {{CLS::sync}}, why does the calculation of {{needToMarkData}} include {{SYNC_MARKER_SIZE}}? I could be missing something, but it seems to me that we only need check {{allocatePosition.get() > lastMarkerOffset}}. * On 3.0, in ACLS's runnable shouldn't this be checking {{shutdown}} not {{run}} because we want to flush if shutdown *has* been requested? {code}if (lastSyncedAt + syncIntervalMillis <= pollStarted || run || requestSync){code} The dtests seem to be pretty broken on all 3 branches, is that teething troubles with the custom circleci yaml, or are you still working on those? > Multithreaded commitlog subtly changed durability > ------------------------------------------------- > > Key: CASSANDRA-13987 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13987 > Project: Cassandra > Issue Type: Improvement > Reporter: Jason Brown > Assignee: Jason Brown > Fix For: 4.x > > > When multithreaded commitlog was introduced in CASSANDRA-3578, we subtly > changed the way that commitlog durability worked. Everything still gets > written to an mmap file. However, not everything is replayable from the > mmaped file after a process crash, in periodic mode. > In brief, the reason this changesd is due to the chained markers that are > required for the multithreaded commit log. At each msync, we wait for > outstanding mutations to serialize into the commitlog, and update a marker > before and after the commits that have accumluated since the last sync. With > those markers, we can safely replay that section of the commitlog. Without > the markers, we have no guarantee that the commits in that section were > successfully written, thus we abandon those commits on replay. > If you have correlated process failures of multiple nodes at "nearly" the > same time (see ["There Is No > Now"|http://queue.acm.org/detail.cfm?id=2745385]), it is possible to have > data loss if none of the nodes msync the commitlog. For example, with RF=3, > if quorum write succeeds on two nodes (and we acknowledge the write back to > the client), and then the process on both nodes OOMs (say, due to reading the > index for a 100GB partition), the write will be lost if neither process > msync'ed the commitlog. More exactly, the commitlog cannot be fully replayed. > The reason why this data is silently lost is due to the chained markers that > were introduced with CASSANDRA-3578. > The problem we are addressing with this ticket is incrementally improving > 'durability' due to process crash, not host crash. (Note: operators should > use batch mode to ensure greater durability, but batch mode in it's current > implementation is a) borked, and b) will burn through, *very* rapidly, SSDs > that don't have a non-volatile write cache sitting in front.) > The current default for {{commitlog_sync_period_in_ms}} is 10 seconds, which > means that a node could lose up to ten seconds of data due to process crash. > The unfortunate thing is that the data is still avaialble, in the mmap file, > but we can't replay it due to incomplete chained markers. > ftr, I don't believe we've ever had a stated policy about commitlog > durability wrt process crash. Pre-2.0 we naturally piggy-backed off the > memory mapped file and the fact that every mutation was acquired a lock and > wrote into the mmap buffer, and the ability to replay everything out of it > came for free. With CASSANDRA-3578, that was subtly changed. > Something [~jjirsa] pointed out to me is that [MySQL provides a way to adjust > the durability > guarantees|https://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_flush_log_at_trx_commit] > of each commit in innodb via the {{innodb_flush_log_at_trx_commit}}. I'm > using that idea as a loose springboard for what to do here. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org