Thanks for clarifying, Alexey.

Alexey / David, I was able to reproduce the problem with a more minimized
test case, which I just posted to Gerrit:
http://gerrit.cloudera.org:8080/5256

I also filed a JIRA ticket to cover this issue independent of the test
breaking or not: https://issues.apache.org/jira/browse/KUDU-1767

Off the top of my head, I can think of a couple of ways we can address this
problem:

1. Default the KuduSession::SetMutationBufferMaxNum() to 1 - will cause a
perf hit and not exactly foolproof, but low effort.
2. Implement well-ordering on the server such that if a client has an
existing session with a server then the server will prepare requests from a
single client session in-order. This is less trivial to do.

Whatever we do, we should try to do it for all affected clients.

Thoughts on the above? In my opinion this is a serious problem for the
read-your-writes effort and leaves AUTO_BACKGROUND_FLUSH feeling pretty
broken.

Best,
Mike

On Mon, Nov 28, 2016 at 7:25 PM, Alexey Serbin <aser...@cloudera.com> wrote:

> Mike,
>
> Thank you for clarifying on this and opening the JIRA item which tracks
> the issue.
>
> I'm not an expert on exactly-once semantics, but it seems David provided
> more information on KUDU-1761.
>
> Indeed, when using the auto background flush the client library does not
> provide a means to guarantee the order of operation completion on the
> server side, unless the number of batches is limited to 1 (by default it's
> 2).  From the client perspective, the client sends the operations
> sequentially, but due to the background flushing they might be sent to the
> server in parallel and completing at the server side even in reverse
> order.  So, if the exactly-once semantics does not play here, we should
> address this in a different way.
>
>
> Best regards,
>
> Alexey
>
> On Thu, Nov 24, 2016 at 7:32 AM, Mike Percy <mpe...@apache.org> wrote:
>
>> I investigated this further today and I believe this has to do with
>> concurrent client flushes interleaving, resulting in out-of-order writes
>> on
>> the server side.
>>
>> More details in https://issues.apache.org/jira/browse/KUDU-1761
>>
>> David / Alexey, based on your knowledge of exactly-once semantics and the
>> client flush path does my explanation make sense? I'll dig into it further
>> but based on my hazy memory I think we allow 2 outstanding flushes at a
>> time and do not enforce ordering with our exactly-once mechanism, only
>> idempotence of individual operations (IIRC we do not do flow control if a
>> client sequence number is skipped).
>>
>> Mike
>>
>> On Tue, Nov 22, 2016 at 5:10 PM, Mike Percy <mpe...@apache.org> wrote:
>>
>> > Alexey says that KuduSession::Flush() already blocks on all outstanding
>> > async flushes, and apparently that's true, so either there is a
>> non-obvious
>> > bug there or it's caused by something else.
>> >
>> > Mike
>> >
>> > On Tue, Nov 22, 2016 at 4:46 PM, Mike Percy <mpe...@apache.org> wrote:
>> >
>> >> Hi Alexey,
>> >> Thank you very much for investigating this test and providing an easy
>> >> repro case. I just got around to spending some time on this due to
>> David's
>> >> read-your-writes (RYW) patches adding to this test. I think we should
>> try
>> >> to make this test reliable to be sure RYW doesn't have anomalies we
>> aren't
>> >> aware of. I applied your (second) patch mentioned earlier in this
>> thread
>> >> and was able to quickly reproduce the issues you described.
>> >>
>> >> I suspect that the problem stems from the fact that in the C++ client,
>> >> session->Flush() is a no-op if an async flush is already in-progress.
>> At
>> >> least, it looks to me like those are the current semantics. If I
>> remove the
>> >> session->FlushAsync(null) from your patch then the test passes for me.
>> >>
>> >> The problem with a non-blocking Flush() while an async flush is
>> >> outstanding that is that tablet_history_gc-itest relies on the
>> following
>> >> series of operations in order to ensure it knows what the consistent
>> set of
>> >> values should be for a particular timestamp:
>> >>
>> >>           FlushSessionOrDie(session);
>> >>         }
>> >>         SaveSnapshot(std::move(snapshot), clock_->Now());
>> >>
>> >> So it waits until the data is flushed to the server and then captures
>> the
>> >> current server timestamp as a key for its record of expected data
>> values at
>> >> that timestamp. The above appears in all of the write code paths in the
>> >> test. If FlushSessionOrDie() is an async no-op, and the client has not
>> >> truly flushed outstanding data at the current time stamp, then the
>> >> "SaveSnapshot()" code path will think the latest writes are included
>> in the
>> >> snapshot for the current time, but in actuality the old value will be
>> >> stored for that timestamp on the server.
>> >>
>> >> This matches with the observed behavior based on running the (patched)
>> >> test with --vmodule=tablet_history_gc-itest=2 and seeing the following
>> >> in the log:
>> >>
>> >> I1122 15:42:48.355686 15303 tablet_history_gc-itest.cc:166] Saving
>> >> snapshot at ts = P: 1000000000 usec, L: 1 (4096000000001)
>> >> I1122 15:42:48.355768 15303 tablet_history_gc-itest.cc:380] Starting
>> >> round 1
>> >> ...
>> >> I1122 15:42:48.355917 15303 tablet_history_gc-itest.cc:463] Updating
>> row
>> >> to { 511, 1535296252, 1535296251, NOT_DELETED }
>> >> I1122 15:42:48.355926 15303 tablet_history_gc-itest.cc:463] Updating
>> row
>> >> to { 511, 1708088659, 1708088658, NOT_DELETED }
>> >> ...
>> >> I1122 15:42:48.975769 15303 tablet_history_gc-itest.cc:166] Saving
>> >> snapshot at ts = P: 1000000000 usec, L: 514 (4096000000514)
>> >> I1122 15:42:48.975796 15303 tablet_history_gc-itest.cc:496] Updated
>> 512
>> >> rows
>> >> I1122 15:42:48.975807 15303 tablet_history_gc-itest.cc:380] Starting
>> >> round 2
>> >> ...
>> >> I1122 15:42:54.547406 15303 tablet_history_gc-itest.cc:213] Round 26:
>> >> Verifying snapshot scan for timestamp P: 1000000000 usec, L: 1559
>> >> (4096000001559)
>> >> ../../src/kudu/integration-tests/tablet_history_gc-itest.cc:242:
>> Failure
>> >> Value of: int_val
>> >>   Actual: 1535296252
>> >> Expected: snap_iter->second.int_val
>> >> Which is: 1708088659
>> >> at row key 511
>> >>
>> >> So I think we have a couple options:
>> >>
>> >>    1. Make KuduSession::Flush() "timeline-consistent" (block until all
>> >>    outstanding flushes complete), or
>> >>    2. Only use manual flushing in this test​
>> >>
>> >> Personally, I think the current observed semantics of
>> >> KuduSession::Flush() are not ideal and I'd advocate that we do #1.
>> >>
>> >> Mike
>> >>
>> >>
>> >
>>
>
>

Reply via email to