Thanks for clarifying, Alexey. Alexey / David, I was able to reproduce the problem with a more minimized test case, which I just posted to Gerrit: http://gerrit.cloudera.org:8080/5256
I also filed a JIRA ticket to cover this issue independent of the test breaking or not: https://issues.apache.org/jira/browse/KUDU-1767 Off the top of my head, I can think of a couple of ways we can address this problem: 1. Default the KuduSession::SetMutationBufferMaxNum() to 1 - will cause a perf hit and not exactly foolproof, but low effort. 2. Implement well-ordering on the server such that if a client has an existing session with a server then the server will prepare requests from a single client session in-order. This is less trivial to do. Whatever we do, we should try to do it for all affected clients. Thoughts on the above? In my opinion this is a serious problem for the read-your-writes effort and leaves AUTO_BACKGROUND_FLUSH feeling pretty broken. Best, Mike On Mon, Nov 28, 2016 at 7:25 PM, Alexey Serbin <aser...@cloudera.com> wrote: > Mike, > > Thank you for clarifying on this and opening the JIRA item which tracks > the issue. > > I'm not an expert on exactly-once semantics, but it seems David provided > more information on KUDU-1761. > > Indeed, when using the auto background flush the client library does not > provide a means to guarantee the order of operation completion on the > server side, unless the number of batches is limited to 1 (by default it's > 2). From the client perspective, the client sends the operations > sequentially, but due to the background flushing they might be sent to the > server in parallel and completing at the server side even in reverse > order. So, if the exactly-once semantics does not play here, we should > address this in a different way. > > > Best regards, > > Alexey > > On Thu, Nov 24, 2016 at 7:32 AM, Mike Percy <mpe...@apache.org> wrote: > >> I investigated this further today and I believe this has to do with >> concurrent client flushes interleaving, resulting in out-of-order writes >> on >> the server side. >> >> More details in https://issues.apache.org/jira/browse/KUDU-1761 >> >> David / Alexey, based on your knowledge of exactly-once semantics and the >> client flush path does my explanation make sense? I'll dig into it further >> but based on my hazy memory I think we allow 2 outstanding flushes at a >> time and do not enforce ordering with our exactly-once mechanism, only >> idempotence of individual operations (IIRC we do not do flow control if a >> client sequence number is skipped). >> >> Mike >> >> On Tue, Nov 22, 2016 at 5:10 PM, Mike Percy <mpe...@apache.org> wrote: >> >> > Alexey says that KuduSession::Flush() already blocks on all outstanding >> > async flushes, and apparently that's true, so either there is a >> non-obvious >> > bug there or it's caused by something else. >> > >> > Mike >> > >> > On Tue, Nov 22, 2016 at 4:46 PM, Mike Percy <mpe...@apache.org> wrote: >> > >> >> Hi Alexey, >> >> Thank you very much for investigating this test and providing an easy >> >> repro case. I just got around to spending some time on this due to >> David's >> >> read-your-writes (RYW) patches adding to this test. I think we should >> try >> >> to make this test reliable to be sure RYW doesn't have anomalies we >> aren't >> >> aware of. I applied your (second) patch mentioned earlier in this >> thread >> >> and was able to quickly reproduce the issues you described. >> >> >> >> I suspect that the problem stems from the fact that in the C++ client, >> >> session->Flush() is a no-op if an async flush is already in-progress. >> At >> >> least, it looks to me like those are the current semantics. If I >> remove the >> >> session->FlushAsync(null) from your patch then the test passes for me. >> >> >> >> The problem with a non-blocking Flush() while an async flush is >> >> outstanding that is that tablet_history_gc-itest relies on the >> following >> >> series of operations in order to ensure it knows what the consistent >> set of >> >> values should be for a particular timestamp: >> >> >> >> FlushSessionOrDie(session); >> >> } >> >> SaveSnapshot(std::move(snapshot), clock_->Now()); >> >> >> >> So it waits until the data is flushed to the server and then captures >> the >> >> current server timestamp as a key for its record of expected data >> values at >> >> that timestamp. The above appears in all of the write code paths in the >> >> test. If FlushSessionOrDie() is an async no-op, and the client has not >> >> truly flushed outstanding data at the current time stamp, then the >> >> "SaveSnapshot()" code path will think the latest writes are included >> in the >> >> snapshot for the current time, but in actuality the old value will be >> >> stored for that timestamp on the server. >> >> >> >> This matches with the observed behavior based on running the (patched) >> >> test with --vmodule=tablet_history_gc-itest=2 and seeing the following >> >> in the log: >> >> >> >> I1122 15:42:48.355686 15303 tablet_history_gc-itest.cc:166] Saving >> >> snapshot at ts = P: 1000000000 usec, L: 1 (4096000000001) >> >> I1122 15:42:48.355768 15303 tablet_history_gc-itest.cc:380] Starting >> >> round 1 >> >> ... >> >> I1122 15:42:48.355917 15303 tablet_history_gc-itest.cc:463] Updating >> row >> >> to { 511, 1535296252, 1535296251, NOT_DELETED } >> >> I1122 15:42:48.355926 15303 tablet_history_gc-itest.cc:463] Updating >> row >> >> to { 511, 1708088659, 1708088658, NOT_DELETED } >> >> ... >> >> I1122 15:42:48.975769 15303 tablet_history_gc-itest.cc:166] Saving >> >> snapshot at ts = P: 1000000000 usec, L: 514 (4096000000514) >> >> I1122 15:42:48.975796 15303 tablet_history_gc-itest.cc:496] Updated >> 512 >> >> rows >> >> I1122 15:42:48.975807 15303 tablet_history_gc-itest.cc:380] Starting >> >> round 2 >> >> ... >> >> I1122 15:42:54.547406 15303 tablet_history_gc-itest.cc:213] Round 26: >> >> Verifying snapshot scan for timestamp P: 1000000000 usec, L: 1559 >> >> (4096000001559) >> >> ../../src/kudu/integration-tests/tablet_history_gc-itest.cc:242: >> Failure >> >> Value of: int_val >> >> Actual: 1535296252 >> >> Expected: snap_iter->second.int_val >> >> Which is: 1708088659 >> >> at row key 511 >> >> >> >> So I think we have a couple options: >> >> >> >> 1. Make KuduSession::Flush() "timeline-consistent" (block until all >> >> outstanding flushes complete), or >> >> 2. Only use manual flushing in this test​ >> >> >> >> Personally, I think the current observed semantics of >> >> KuduSession::Flush() are not ideal and I'd advocate that we do #1. >> >> >> >> Mike >> >> >> >> >> > >> > >