Re: Flaky tablet_history_gc-itest

2016-11-29 Thread Mike Percy
Thanks for clarifying, Alexey.

Alexey / David, I was able to reproduce the problem with a more minimized
test case, which I just posted to Gerrit:
http://gerrit.cloudera.org:8080/5256

I also filed a JIRA ticket to cover this issue independent of the test
breaking or not: https://issues.apache.org/jira/browse/KUDU-1767

Off the top of my head, I can think of a couple of ways we can address this
problem:

1. Default the KuduSession::SetMutationBufferMaxNum() to 1 - will cause a
perf hit and not exactly foolproof, but low effort.
2. Implement well-ordering on the server such that if a client has an
existing session with a server then the server will prepare requests from a
single client session in-order. This is less trivial to do.

Whatever we do, we should try to do it for all affected clients.

Thoughts on the above? In my opinion this is a serious problem for the
read-your-writes effort and leaves AUTO_BACKGROUND_FLUSH feeling pretty
broken.

Best,
Mike

On Mon, Nov 28, 2016 at 7:25 PM, Alexey Serbin  wrote:

> Mike,
>
> Thank you for clarifying on this and opening the JIRA item which tracks
> the issue.
>
> I'm not an expert on exactly-once semantics, but it seems David provided
> more information on KUDU-1761.
>
> Indeed, when using the auto background flush the client library does not
> provide a means to guarantee the order of operation completion on the
> server side, unless the number of batches is limited to 1 (by default it's
> 2).  From the client perspective, the client sends the operations
> sequentially, but due to the background flushing they might be sent to the
> server in parallel and completing at the server side even in reverse
> order.  So, if the exactly-once semantics does not play here, we should
> address this in a different way.
>
>
> Best regards,
>
> Alexey
>
> On Thu, Nov 24, 2016 at 7:32 AM, Mike Percy  wrote:
>
>> I investigated this further today and I believe this has to do with
>> concurrent client flushes interleaving, resulting in out-of-order writes
>> on
>> the server side.
>>
>> More details in https://issues.apache.org/jira/browse/KUDU-1761
>>
>> David / Alexey, based on your knowledge of exactly-once semantics and the
>> client flush path does my explanation make sense? I'll dig into it further
>> but based on my hazy memory I think we allow 2 outstanding flushes at a
>> time and do not enforce ordering with our exactly-once mechanism, only
>> idempotence of individual operations (IIRC we do not do flow control if a
>> client sequence number is skipped).
>>
>> Mike
>>
>> On Tue, Nov 22, 2016 at 5:10 PM, Mike Percy  wrote:
>>
>> > Alexey says that KuduSession::Flush() already blocks on all outstanding
>> > async flushes, and apparently that's true, so either there is a
>> non-obvious
>> > bug there or it's caused by something else.
>> >
>> > Mike
>> >
>> > On Tue, Nov 22, 2016 at 4:46 PM, Mike Percy  wrote:
>> >
>> >> Hi Alexey,
>> >> Thank you very much for investigating this test and providing an easy
>> >> repro case. I just got around to spending some time on this due to
>> David's
>> >> read-your-writes (RYW) patches adding to this test. I think we should
>> try
>> >> to make this test reliable to be sure RYW doesn't have anomalies we
>> aren't
>> >> aware of. I applied your (second) patch mentioned earlier in this
>> thread
>> >> and was able to quickly reproduce the issues you described.
>> >>
>> >> I suspect that the problem stems from the fact that in the C++ client,
>> >> session->Flush() is a no-op if an async flush is already in-progress.
>> At
>> >> least, it looks to me like those are the current semantics. If I
>> remove the
>> >> session->FlushAsync(null) from your patch then the test passes for me.
>> >>
>> >> The problem with a non-blocking Flush() while an async flush is
>> >> outstanding that is that tablet_history_gc-itest relies on the
>> following
>> >> series of operations in order to ensure it knows what the consistent
>> set of
>> >> values should be for a particular timestamp:
>> >>
>> >>   FlushSessionOrDie(session);
>> >> }
>> >> SaveSnapshot(std::move(snapshot), clock_->Now());
>> >>
>> >> So it waits until the data is flushed to the server and then captures
>> the
>> >> current server timestamp as a key for its record of expected data
>> values at
>> >> that timestamp. The above appears in all of the write code paths in the
>> >> test. If FlushSessionOrDie() is an async no-op, and the client has not
>> >> truly flushed outstanding data at the current time stamp, then the
>> >> "SaveSnapshot()" code path will think the latest writes are included
>> in the
>> >> snapshot for the current time, but in actuality the old value will be
>> >> stored for that timestamp on the server.
>> >>
>> >> This matches with the observed behavior based on running the (patched)
>> >> test with --vmodule=tablet_history_gc-itest=2 and seeing the following

Re: Flaky tablet_history_gc-itest

2016-11-28 Thread Alexey Serbin
Mike,

Thank you for clarifying on this and opening the JIRA item which tracks the
issue.

I'm not an expert on exactly-once semantics, but it seems David provided
more information on KUDU-1761.

Indeed, when using the auto background flush the client library does not
provide a means to guarantee the order of operation completion on the
server side, unless the number of batches is limited to 1 (by default it's
2).  From the client perspective, the client sends the operations
sequentially, but due to the background flushing they might be sent to the
server in parallel and completing at the server side even in reverse
order.  So, if the exactly-once semantics does not play here, we should
address this in a different way.


Best regards,

Alexey

On Thu, Nov 24, 2016 at 7:32 AM, Mike Percy  wrote:

> I investigated this further today and I believe this has to do with
> concurrent client flushes interleaving, resulting in out-of-order writes on
> the server side.
>
> More details in https://issues.apache.org/jira/browse/KUDU-1761
>
> David / Alexey, based on your knowledge of exactly-once semantics and the
> client flush path does my explanation make sense? I'll dig into it further
> but based on my hazy memory I think we allow 2 outstanding flushes at a
> time and do not enforce ordering with our exactly-once mechanism, only
> idempotence of individual operations (IIRC we do not do flow control if a
> client sequence number is skipped).
>
> Mike
>
> On Tue, Nov 22, 2016 at 5:10 PM, Mike Percy  wrote:
>
> > Alexey says that KuduSession::Flush() already blocks on all outstanding
> > async flushes, and apparently that's true, so either there is a
> non-obvious
> > bug there or it's caused by something else.
> >
> > Mike
> >
> > On Tue, Nov 22, 2016 at 4:46 PM, Mike Percy  wrote:
> >
> >> Hi Alexey,
> >> Thank you very much for investigating this test and providing an easy
> >> repro case. I just got around to spending some time on this due to
> David's
> >> read-your-writes (RYW) patches adding to this test. I think we should
> try
> >> to make this test reliable to be sure RYW doesn't have anomalies we
> aren't
> >> aware of. I applied your (second) patch mentioned earlier in this thread
> >> and was able to quickly reproduce the issues you described.
> >>
> >> I suspect that the problem stems from the fact that in the C++ client,
> >> session->Flush() is a no-op if an async flush is already in-progress. At
> >> least, it looks to me like those are the current semantics. If I remove
> the
> >> session->FlushAsync(null) from your patch then the test passes for me.
> >>
> >> The problem with a non-blocking Flush() while an async flush is
> >> outstanding that is that tablet_history_gc-itest relies on the following
> >> series of operations in order to ensure it knows what the consistent
> set of
> >> values should be for a particular timestamp:
> >>
> >>   FlushSessionOrDie(session);
> >> }
> >> SaveSnapshot(std::move(snapshot), clock_->Now());
> >>
> >> So it waits until the data is flushed to the server and then captures
> the
> >> current server timestamp as a key for its record of expected data
> values at
> >> that timestamp. The above appears in all of the write code paths in the
> >> test. If FlushSessionOrDie() is an async no-op, and the client has not
> >> truly flushed outstanding data at the current time stamp, then the
> >> "SaveSnapshot()" code path will think the latest writes are included in
> the
> >> snapshot for the current time, but in actuality the old value will be
> >> stored for that timestamp on the server.
> >>
> >> This matches with the observed behavior based on running the (patched)
> >> test with --vmodule=tablet_history_gc-itest=2 and seeing the following
> >> in the log:
> >>
> >> I1122 15:42:48.355686 15303 tablet_history_gc-itest.cc:166] Saving
> >> snapshot at ts = P: 10 usec, L: 1 (40961)
> >> I1122 15:42:48.355768 15303 tablet_history_gc-itest.cc:380] Starting
> >> round 1
> >> ...
> >> I1122 15:42:48.355917 15303 tablet_history_gc-itest.cc:463] Updating
> row
> >> to { 511, 1535296252, 1535296251, NOT_DELETED }
> >> I1122 15:42:48.355926 15303 tablet_history_gc-itest.cc:463] Updating
> row
> >> to { 511, 1708088659, 1708088658, NOT_DELETED }
> >> ...
> >> I1122 15:42:48.975769 15303 tablet_history_gc-itest.cc:166] Saving
> >> snapshot at ts = P: 10 usec, L: 514 (409600514)
> >> I1122 15:42:48.975796 15303 tablet_history_gc-itest.cc:496] Updated 512
> >> rows
> >> I1122 15:42:48.975807 15303 tablet_history_gc-itest.cc:380] Starting
> >> round 2
> >> ...
> >> I1122 15:42:54.547406 15303 tablet_history_gc-itest.cc:213] Round 26:
> >> Verifying snapshot scan for timestamp P: 10 usec, L: 1559
> >> (409601559)
> >> ../../src/kudu/integration-tests/tablet_history_gc-itest.cc:242:
> Failure
> >> Value of: int_val
> >>   Actual: 1535296252
> >> Expected: 

Re: Flaky tablet_history_gc-itest

2016-10-12 Thread David Alves
No, you're right. I was looking at another test in that file.
Hum, then the opportunity for strange things to happen is much lower.
Will look more closely.

-david

On Wed, Oct 12, 2016 at 1:59 PM, Alexey Serbin  wrote:

> Hi David,
>
> Thank you for taking a look at that.
> I think the test already uses just one tablet server, so no replicas would
> be possible.  I see the following code in the test:
>
>   StartCluster(1); // Start MiniCluster with a single tablet server.
>
>   TestWorkload workload(cluster_.get());
>
>   workload.set_num_replicas(1);
>
>   workload.Setup(); // Convenient way to create a table.
>
>
>
> Did I miss something?  I.e. should I toggle just another control knob
> somewhere?
>
>
> Thanks,
>
> Alexey
>
> On Wed, Oct 12, 2016 at 1:43 PM, David Alves 
> wrote:
>
> > Hi Alexey
> >
> > Thanks for going down the rabbit hole.
> > Could you try your patch without tablet replication for that test? If the
> > problem persists it's unlikely that it's related to the current
> consistency
> > gaps we have.
> > I'm a bit suspicious in that it seems to be doing snapshot scans without
> > retrying, which is what we're doing pretty much everywhere else to work
> > around our gaps.
> >
> > -david
> >
> > On Wed, Oct 12, 2016 at 1:06 PM, Alexey Serbin 
> > wrote:
> >
> > > One small update: the issue might be not in GC logic, but some other
> > > flakiness related to reading data at snapshot.
> > >
> > > I updated the patch so the only operations the test now does are
> inserts,
> > > updates and scans. No tablet merge compactions, redo delta compactions,
> > > forced re-updates of missing deltas, or moving time forward.  The
> updated
> > > patch can be found at:
> > >   https://gist.github.com/alexeyserbin/06ed8dbdb0e8e9abcbde2991c66156
> 60
> > >
> > > The test firmly fails if running as described in the previous message
> in
> > > this thread, just use the updated patch location.
> > >
> > > David, may be you can take a quick look at that as well?
> > >
> > >
> > > Thanks,
> > >
> > > Alexey
> > >
> > > On Wed, Oct 12, 2016 at 2:01 AM, Alexey Serbin 
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > I played with the test (mostly in background), making the failure
> > almost
> > > > 100% reproducible.
> > > >
> > > > After collecting some evidence, I can say it's a server-side bug.  I
> > > think
> > > > so because the reproduction scenario I'm talking about uses good old
> > > > MANUAL_FLUSH mode, not AUTO_FLUSH_BACKGROUND mode.  Yes, I've
> modified
> > > the
> > > > test slightly to achieve higher reproduction ratio and to clear the
> > > > question whether it's AUTO_FLUSH_BACKGROUND-specific bug.
> > > >
> > > > That's what I found:
> > > >   1. The problem occurs when updating rows with the same primary keys
> > > > multiple times.
> > > >   2. It's crucial to flush (i.e. call KuduSession::Flush() or
> > > > KuduSession::FlushAsync()) freshly applied update operations not just
> > > once
> > > > in the very end of a client session, but multiple times while adding
> > > those
> > > > operations.  If flushing just once in the very end, the issue becomes
> > 0%
> > > > reproducible.
> > > >   3. The more updates for different rows we have, the more likely we
> > hit
> > > > the issue (but there should be at least a couple updates for every
> > row).
> > > >   4. The problem persists in all types of Kudu builds: debug, TSAN,
> > > > release, ASAN (in the decreasing order of the reproduction ratio).
> > > >   5. The problem is also highly reproducible if running the test via
> > the
> > > > dist_test.py utility (check for 256 out of 256 failure ratio at
> > > > http://dist-test.cloudera.org//job?job_id=aserbin.1476258983.2603 )
> > > >
> > > > To build the modified test and run the reproduction scenario:
> > > >   1. Get the patch from https://gist.github.com/alexeyserbin/
> > > > 7c885148dadff8705912f6cc513108d0
> > > >   2. Apply the patch to the latest Kudu source from the master
> branch.
> > > >   3. Build debug, TSAN, release or ASAN configuration and run with
> the
> > > > command (the random seed is not really crucial, but this gives better
> > > > results):
> > > > ../../build-support/run-test.sh ./bin/tablet_history_gc-itest
> > > > --gtest_filter=RandomizedTabletHistoryGcITest
> > > .TestRandomHistoryGCWorkload
> > > > --stress_cpu_threads=64 --test_random_seed=1213726993
> > > >
> > > > 4. If running via dist_test.py, run the following instead:
> > > >
> > > > ../../build-support/dist_test.py loop -n 256 --
> > > > ./bin/tablet_history_gc-itest --gtest_filter=
> > > > RandomizedTabletHistoryGcITest.TestRandomHistoryGCWorkload
> > > > --stress_cpu_threads=8 --test_random_seed=1213726993
> > > >
> > > > Mike, it seems I'll need your help to troubleshoot/debug this issue
> > > > further.
> > > >
> > > >
> > > > Best regards,
> > > >
> > > > Alexey
> > > >
> > > >
> > > > On Mon, Oct 3, 2016 

Re: Flaky tablet_history_gc-itest

2016-10-12 Thread Alexey Serbin
One small update: the issue might be not in GC logic, but some other
flakiness related to reading data at snapshot.

I updated the patch so the only operations the test now does are inserts,
updates and scans. No tablet merge compactions, redo delta compactions,
forced re-updates of missing deltas, or moving time forward.  The updated
patch can be found at:
  https://gist.github.com/alexeyserbin/06ed8dbdb0e8e9abcbde2991c6615660

The test firmly fails if running as described in the previous message in
this thread, just use the updated patch location.

David, may be you can take a quick look at that as well?


Thanks,

Alexey

On Wed, Oct 12, 2016 at 2:01 AM, Alexey Serbin  wrote:

> Hi,
>
> I played with the test (mostly in background), making the failure almost
> 100% reproducible.
>
> After collecting some evidence, I can say it's a server-side bug.  I think
> so because the reproduction scenario I'm talking about uses good old
> MANUAL_FLUSH mode, not AUTO_FLUSH_BACKGROUND mode.  Yes, I've modified the
> test slightly to achieve higher reproduction ratio and to clear the
> question whether it's AUTO_FLUSH_BACKGROUND-specific bug.
>
> That's what I found:
>   1. The problem occurs when updating rows with the same primary keys
> multiple times.
>   2. It's crucial to flush (i.e. call KuduSession::Flush() or
> KuduSession::FlushAsync()) freshly applied update operations not just once
> in the very end of a client session, but multiple times while adding those
> operations.  If flushing just once in the very end, the issue becomes 0%
> reproducible.
>   3. The more updates for different rows we have, the more likely we hit
> the issue (but there should be at least a couple updates for every row).
>   4. The problem persists in all types of Kudu builds: debug, TSAN,
> release, ASAN (in the decreasing order of the reproduction ratio).
>   5. The problem is also highly reproducible if running the test via the
> dist_test.py utility (check for 256 out of 256 failure ratio at
> http://dist-test.cloudera.org//job?job_id=aserbin.1476258983.2603 )
>
> To build the modified test and run the reproduction scenario:
>   1. Get the patch from https://gist.github.com/alexeyserbin/
> 7c885148dadff8705912f6cc513108d0
>   2. Apply the patch to the latest Kudu source from the master branch.
>   3. Build debug, TSAN, release or ASAN configuration and run with the
> command (the random seed is not really crucial, but this gives better
> results):
> ../../build-support/run-test.sh ./bin/tablet_history_gc-itest
> --gtest_filter=RandomizedTabletHistoryGcITest.TestRandomHistoryGCWorkload
> --stress_cpu_threads=64 --test_random_seed=1213726993
>
> 4. If running via dist_test.py, run the following instead:
>
> ../../build-support/dist_test.py loop -n 256 --
> ./bin/tablet_history_gc-itest --gtest_filter=
> RandomizedTabletHistoryGcITest.TestRandomHistoryGCWorkload
> --stress_cpu_threads=8 --test_random_seed=1213726993
>
> Mike, it seems I'll need your help to troubleshoot/debug this issue
> further.
>
>
> Best regards,
>
> Alexey
>
>
> On Mon, Oct 3, 2016 at 9:48 AM, Alexey Serbin 
> wrote:
>
>> Todd,
>>
>> I apologize for the late response -- somehow my inbox is messed up.
>> Probably, I need to switch to use stand-alone mail application (as iMail)
>> instead of browser-based one.
>>
>> Yes, I'll take a look at that.
>>
>>
>> Best regards,
>>
>> Alexey
>>
>> On Mon, Sep 26, 2016 at 12:58 PM, Todd Lipcon  wrote:
>>
>>> This test has gotten flaky with a concerning failure mode (seeing
>>> "wrong" results, not just a timeout or something):
>>>
>>> http://dist-test.cloudera.org:8080/test_drilldown?test_name=
>>> tablet_history_gc-itest
>>>
>>> It seems like it got flaky starting with Alexey's
>>> commit bc14b2f9d775c9f27f2e2be36d4b03080977e8fa which switched it to
>>> use AUTO_FLUSH_BACKGROUND. So perhaps the bug is actually a client bug and
>>> not anything to do with GC.
>>>
>>> Alexey, do you have time to take a look, and perhaps consult with Mike
>>> if you think it's actually a server-side bug?
>>>
>>> -Todd
>>>
>>> --
>>> Todd Lipcon
>>> Software Engineer, Cloudera
>>>
>>
>>
>


Re: Flaky tablet_history_gc-itest

2016-10-12 Thread Alexey Serbin
Hi,

I played with the test (mostly in background), making the failure almost
100% reproducible.

After collecting some evidence, I can say it's a server-side bug.  I think
so because the reproduction scenario I'm talking about uses good old
MANUAL_FLUSH mode, not AUTO_FLUSH_BACKGROUND mode.  Yes, I've modified the
test slightly to achieve higher reproduction ratio and to clear the
question whether it's AUTO_FLUSH_BACKGROUND-specific bug.

That's what I found:
  1. The problem occurs when updating rows with the same primary keys
multiple times.
  2. It's crucial to flush (i.e. call KuduSession::Flush() or
KuduSession::FlushAsync()) freshly applied update operations not just once
in the very end of a client session, but multiple times while adding those
operations.  If flushing just once in the very end, the issue becomes 0%
reproducible.
  3. The more updates for different rows we have, the more likely we hit
the issue (but there should be at least a couple updates for every row).
  4. The problem persists in all types of Kudu builds: debug, TSAN,
release, ASAN (in the decreasing order of the reproduction ratio).
  5. The problem is also highly reproducible if running the test via the
dist_test.py utility (check for 256 out of 256 failure ratio at
http://dist-test.cloudera.org//job?job_id=aserbin.1476258983.2603 )

To build the modified test and run the reproduction scenario:
  1. Get the patch from
https://gist.github.com/alexeyserbin/7c885148dadff8705912f6cc513108d0
  2. Apply the patch to the latest Kudu source from the master branch.
  3. Build debug, TSAN, release or ASAN configuration and run with the
command (the random seed is not really crucial, but this gives better
results):
../../build-support/run-test.sh ./bin/tablet_history_gc-itest
--gtest_filter=RandomizedTabletHistoryGcITest.TestRandomHistoryGCWorkload
--stress_cpu_threads=64 --test_random_seed=1213726993

4. If running via dist_test.py, run the following instead:

../../build-support/dist_test.py loop -n 256 --
./bin/tablet_history_gc-itest
--gtest_filter=RandomizedTabletHistoryGcITest.TestRandomHistoryGCWorkload
--stress_cpu_threads=8 --test_random_seed=1213726993

Mike, it seems I'll need your help to troubleshoot/debug this issue further.


Best regards,

Alexey


On Mon, Oct 3, 2016 at 9:48 AM, Alexey Serbin  wrote:

> Todd,
>
> I apologize for the late response -- somehow my inbox is messed up.
> Probably, I need to switch to use stand-alone mail application (as iMail)
> instead of browser-based one.
>
> Yes, I'll take a look at that.
>
>
> Best regards,
>
> Alexey
>
> On Mon, Sep 26, 2016 at 12:58 PM, Todd Lipcon  wrote:
>
>> This test has gotten flaky with a concerning failure mode (seeing "wrong"
>> results, not just a timeout or something):
>>
>> http://dist-test.cloudera.org:8080/test_drilldown?test_name=
>> tablet_history_gc-itest
>>
>> It seems like it got flaky starting with Alexey's
>> commit bc14b2f9d775c9f27f2e2be36d4b03080977e8fa which switched it to use
>> AUTO_FLUSH_BACKGROUND. So perhaps the bug is actually a client bug and not
>> anything to do with GC.
>>
>> Alexey, do you have time to take a look, and perhaps consult with Mike if
>> you think it's actually a server-side bug?
>>
>> -Todd
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>
>


Flaky tablet_history_gc-itest

2016-09-26 Thread Todd Lipcon
This test has gotten flaky with a concerning failure mode (seeing "wrong"
results, not just a timeout or something):

http://dist-test.cloudera.org:8080/test_drilldown?test_name=tablet_history_gc-itest

It seems like it got flaky starting with Alexey's
commit bc14b2f9d775c9f27f2e2be36d4b03080977e8fa which switched it to use
AUTO_FLUSH_BACKGROUND. So perhaps the bug is actually a client bug and not
anything to do with GC.

Alexey, do you have time to take a look, and perhaps consult with Mike if
you think it's actually a server-side bug?

-Todd

-- 
Todd Lipcon
Software Engineer, Cloudera