Mike,

Thank you for clarifying on this and opening the JIRA item which tracks the
issue.

I'm not an expert on exactly-once semantics, but it seems David provided
more information on KUDU-1761.

Indeed, when using the auto background flush the client library does not
provide a means to guarantee the order of operation completion on the
server side, unless the number of batches is limited to 1 (by default it's
2).  From the client perspective, the client sends the operations
sequentially, but due to the background flushing they might be sent to the
server in parallel and completing at the server side even in reverse
order.  So, if the exactly-once semantics does not play here, we should
address this in a different way.


Best regards,

Alexey

On Thu, Nov 24, 2016 at 7:32 AM, Mike Percy <[email protected]> wrote:

> I investigated this further today and I believe this has to do with
> concurrent client flushes interleaving, resulting in out-of-order writes on
> the server side.
>
> More details in https://issues.apache.org/jira/browse/KUDU-1761
>
> David / Alexey, based on your knowledge of exactly-once semantics and the
> client flush path does my explanation make sense? I'll dig into it further
> but based on my hazy memory I think we allow 2 outstanding flushes at a
> time and do not enforce ordering with our exactly-once mechanism, only
> idempotence of individual operations (IIRC we do not do flow control if a
> client sequence number is skipped).
>
> Mike
>
> On Tue, Nov 22, 2016 at 5:10 PM, Mike Percy <[email protected]> wrote:
>
> > Alexey says that KuduSession::Flush() already blocks on all outstanding
> > async flushes, and apparently that's true, so either there is a
> non-obvious
> > bug there or it's caused by something else.
> >
> > Mike
> >
> > On Tue, Nov 22, 2016 at 4:46 PM, Mike Percy <[email protected]> wrote:
> >
> >> Hi Alexey,
> >> Thank you very much for investigating this test and providing an easy
> >> repro case. I just got around to spending some time on this due to
> David's
> >> read-your-writes (RYW) patches adding to this test. I think we should
> try
> >> to make this test reliable to be sure RYW doesn't have anomalies we
> aren't
> >> aware of. I applied your (second) patch mentioned earlier in this thread
> >> and was able to quickly reproduce the issues you described.
> >>
> >> I suspect that the problem stems from the fact that in the C++ client,
> >> session->Flush() is a no-op if an async flush is already in-progress. At
> >> least, it looks to me like those are the current semantics. If I remove
> the
> >> session->FlushAsync(null) from your patch then the test passes for me.
> >>
> >> The problem with a non-blocking Flush() while an async flush is
> >> outstanding that is that tablet_history_gc-itest relies on the following
> >> series of operations in order to ensure it knows what the consistent
> set of
> >> values should be for a particular timestamp:
> >>
> >>           FlushSessionOrDie(session);
> >>         }
> >>         SaveSnapshot(std::move(snapshot), clock_->Now());
> >>
> >> So it waits until the data is flushed to the server and then captures
> the
> >> current server timestamp as a key for its record of expected data
> values at
> >> that timestamp. The above appears in all of the write code paths in the
> >> test. If FlushSessionOrDie() is an async no-op, and the client has not
> >> truly flushed outstanding data at the current time stamp, then the
> >> "SaveSnapshot()" code path will think the latest writes are included in
> the
> >> snapshot for the current time, but in actuality the old value will be
> >> stored for that timestamp on the server.
> >>
> >> This matches with the observed behavior based on running the (patched)
> >> test with --vmodule=tablet_history_gc-itest=2 and seeing the following
> >> in the log:
> >>
> >> I1122 15:42:48.355686 15303 tablet_history_gc-itest.cc:166] Saving
> >> snapshot at ts = P: 1000000000 usec, L: 1 (4096000000001)
> >> I1122 15:42:48.355768 15303 tablet_history_gc-itest.cc:380] Starting
> >> round 1
> >> ...
> >> I1122 15:42:48.355917 15303 tablet_history_gc-itest.cc:463] Updating
> row
> >> to { 511, 1535296252, 1535296251, NOT_DELETED }
> >> I1122 15:42:48.355926 15303 tablet_history_gc-itest.cc:463] Updating
> row
> >> to { 511, 1708088659, 1708088658, NOT_DELETED }
> >> ...
> >> I1122 15:42:48.975769 15303 tablet_history_gc-itest.cc:166] Saving
> >> snapshot at ts = P: 1000000000 usec, L: 514 (4096000000514)
> >> I1122 15:42:48.975796 15303 tablet_history_gc-itest.cc:496] Updated 512
> >> rows
> >> I1122 15:42:48.975807 15303 tablet_history_gc-itest.cc:380] Starting
> >> round 2
> >> ...
> >> I1122 15:42:54.547406 15303 tablet_history_gc-itest.cc:213] Round 26:
> >> Verifying snapshot scan for timestamp P: 1000000000 usec, L: 1559
> >> (4096000001559)
> >> ../../src/kudu/integration-tests/tablet_history_gc-itest.cc:242:
> Failure
> >> Value of: int_val
> >>   Actual: 1535296252
> >> Expected: snap_iter->second.int_val
> >> Which is: 1708088659
> >> at row key 511
> >>
> >> So I think we have a couple options:
> >>
> >>    1. Make KuduSession::Flush() "timeline-consistent" (block until all
> >>    outstanding flushes complete), or
> >>    2. Only use manual flushing in this test​
> >>
> >> Personally, I think the current observed semantics of
> >> KuduSession::Flush() are not ideal and I'd advocate that we do #1.
> >>
> >> Mike
> >>
> >>
> >
>

Reply via email to