I investigated this further today and I believe this has to do with
concurrent client flushes interleaving, resulting in out-of-order writes on
the server side.

More details in https://issues.apache.org/jira/browse/KUDU-1761

David / Alexey, based on your knowledge of exactly-once semantics and the
client flush path does my explanation make sense? I'll dig into it further
but based on my hazy memory I think we allow 2 outstanding flushes at a
time and do not enforce ordering with our exactly-once mechanism, only
idempotence of individual operations (IIRC we do not do flow control if a
client sequence number is skipped).

Mike

On Tue, Nov 22, 2016 at 5:10 PM, Mike Percy <[email protected]> wrote:

> Alexey says that KuduSession::Flush() already blocks on all outstanding
> async flushes, and apparently that's true, so either there is a non-obvious
> bug there or it's caused by something else.
>
> Mike
>
> On Tue, Nov 22, 2016 at 4:46 PM, Mike Percy <[email protected]> wrote:
>
>> Hi Alexey,
>> Thank you very much for investigating this test and providing an easy
>> repro case. I just got around to spending some time on this due to David's
>> read-your-writes (RYW) patches adding to this test. I think we should try
>> to make this test reliable to be sure RYW doesn't have anomalies we aren't
>> aware of. I applied your (second) patch mentioned earlier in this thread
>> and was able to quickly reproduce the issues you described.
>>
>> I suspect that the problem stems from the fact that in the C++ client,
>> session->Flush() is a no-op if an async flush is already in-progress. At
>> least, it looks to me like those are the current semantics. If I remove the
>> session->FlushAsync(null) from your patch then the test passes for me.
>>
>> The problem with a non-blocking Flush() while an async flush is
>> outstanding that is that tablet_history_gc-itest relies on the following
>> series of operations in order to ensure it knows what the consistent set of
>> values should be for a particular timestamp:
>>
>>           FlushSessionOrDie(session);
>>         }
>>         SaveSnapshot(std::move(snapshot), clock_->Now());
>>
>> So it waits until the data is flushed to the server and then captures the
>> current server timestamp as a key for its record of expected data values at
>> that timestamp. The above appears in all of the write code paths in the
>> test. If FlushSessionOrDie() is an async no-op, and the client has not
>> truly flushed outstanding data at the current time stamp, then the
>> "SaveSnapshot()" code path will think the latest writes are included in the
>> snapshot for the current time, but in actuality the old value will be
>> stored for that timestamp on the server.
>>
>> This matches with the observed behavior based on running the (patched)
>> test with --vmodule=tablet_history_gc-itest=2 and seeing the following
>> in the log:
>>
>> I1122 15:42:48.355686 15303 tablet_history_gc-itest.cc:166] Saving
>> snapshot at ts = P: 1000000000 usec, L: 1 (4096000000001)
>> I1122 15:42:48.355768 15303 tablet_history_gc-itest.cc:380] Starting
>> round 1
>> ...
>> I1122 15:42:48.355917 15303 tablet_history_gc-itest.cc:463] Updating row
>> to { 511, 1535296252, 1535296251, NOT_DELETED }
>> I1122 15:42:48.355926 15303 tablet_history_gc-itest.cc:463] Updating row
>> to { 511, 1708088659, 1708088658, NOT_DELETED }
>> ...
>> I1122 15:42:48.975769 15303 tablet_history_gc-itest.cc:166] Saving
>> snapshot at ts = P: 1000000000 usec, L: 514 (4096000000514)
>> I1122 15:42:48.975796 15303 tablet_history_gc-itest.cc:496] Updated 512
>> rows
>> I1122 15:42:48.975807 15303 tablet_history_gc-itest.cc:380] Starting
>> round 2
>> ...
>> I1122 15:42:54.547406 15303 tablet_history_gc-itest.cc:213] Round 26:
>> Verifying snapshot scan for timestamp P: 1000000000 usec, L: 1559
>> (4096000001559)
>> ../../src/kudu/integration-tests/tablet_history_gc-itest.cc:242: Failure
>> Value of: int_val
>>   Actual: 1535296252
>> Expected: snap_iter->second.int_val
>> Which is: 1708088659
>> at row key 511
>>
>> So I think we have a couple options:
>>
>>    1. Make KuduSession::Flush() "timeline-consistent" (block until all
>>    outstanding flushes complete), or
>>    2. Only use manual flushing in this test​
>>
>> Personally, I think the current observed semantics of
>> KuduSession::Flush() are not ideal and I'd advocate that we do #1.
>>
>> Mike
>>
>>
>

Reply via email to