I investigated this further today and I believe this has to do with concurrent client flushes interleaving, resulting in out-of-order writes on the server side.
More details in https://issues.apache.org/jira/browse/KUDU-1761 David / Alexey, based on your knowledge of exactly-once semantics and the client flush path does my explanation make sense? I'll dig into it further but based on my hazy memory I think we allow 2 outstanding flushes at a time and do not enforce ordering with our exactly-once mechanism, only idempotence of individual operations (IIRC we do not do flow control if a client sequence number is skipped). Mike On Tue, Nov 22, 2016 at 5:10 PM, Mike Percy <[email protected]> wrote: > Alexey says that KuduSession::Flush() already blocks on all outstanding > async flushes, and apparently that's true, so either there is a non-obvious > bug there or it's caused by something else. > > Mike > > On Tue, Nov 22, 2016 at 4:46 PM, Mike Percy <[email protected]> wrote: > >> Hi Alexey, >> Thank you very much for investigating this test and providing an easy >> repro case. I just got around to spending some time on this due to David's >> read-your-writes (RYW) patches adding to this test. I think we should try >> to make this test reliable to be sure RYW doesn't have anomalies we aren't >> aware of. I applied your (second) patch mentioned earlier in this thread >> and was able to quickly reproduce the issues you described. >> >> I suspect that the problem stems from the fact that in the C++ client, >> session->Flush() is a no-op if an async flush is already in-progress. At >> least, it looks to me like those are the current semantics. If I remove the >> session->FlushAsync(null) from your patch then the test passes for me. >> >> The problem with a non-blocking Flush() while an async flush is >> outstanding that is that tablet_history_gc-itest relies on the following >> series of operations in order to ensure it knows what the consistent set of >> values should be for a particular timestamp: >> >> FlushSessionOrDie(session); >> } >> SaveSnapshot(std::move(snapshot), clock_->Now()); >> >> So it waits until the data is flushed to the server and then captures the >> current server timestamp as a key for its record of expected data values at >> that timestamp. The above appears in all of the write code paths in the >> test. If FlushSessionOrDie() is an async no-op, and the client has not >> truly flushed outstanding data at the current time stamp, then the >> "SaveSnapshot()" code path will think the latest writes are included in the >> snapshot for the current time, but in actuality the old value will be >> stored for that timestamp on the server. >> >> This matches with the observed behavior based on running the (patched) >> test with --vmodule=tablet_history_gc-itest=2 and seeing the following >> in the log: >> >> I1122 15:42:48.355686 15303 tablet_history_gc-itest.cc:166] Saving >> snapshot at ts = P: 1000000000 usec, L: 1 (4096000000001) >> I1122 15:42:48.355768 15303 tablet_history_gc-itest.cc:380] Starting >> round 1 >> ... >> I1122 15:42:48.355917 15303 tablet_history_gc-itest.cc:463] Updating row >> to { 511, 1535296252, 1535296251, NOT_DELETED } >> I1122 15:42:48.355926 15303 tablet_history_gc-itest.cc:463] Updating row >> to { 511, 1708088659, 1708088658, NOT_DELETED } >> ... >> I1122 15:42:48.975769 15303 tablet_history_gc-itest.cc:166] Saving >> snapshot at ts = P: 1000000000 usec, L: 514 (4096000000514) >> I1122 15:42:48.975796 15303 tablet_history_gc-itest.cc:496] Updated 512 >> rows >> I1122 15:42:48.975807 15303 tablet_history_gc-itest.cc:380] Starting >> round 2 >> ... >> I1122 15:42:54.547406 15303 tablet_history_gc-itest.cc:213] Round 26: >> Verifying snapshot scan for timestamp P: 1000000000 usec, L: 1559 >> (4096000001559) >> ../../src/kudu/integration-tests/tablet_history_gc-itest.cc:242: Failure >> Value of: int_val >> Actual: 1535296252 >> Expected: snap_iter->second.int_val >> Which is: 1708088659 >> at row key 511 >> >> So I think we have a couple options: >> >> 1. Make KuduSession::Flush() "timeline-consistent" (block until all >> outstanding flushes complete), or >> 2. Only use manual flushing in this test​ >> >> Personally, I think the current observed semantics of >> KuduSession::Flush() are not ideal and I'd advocate that we do #1. >> >> Mike >> >> >
