Re: Flaky tablet_history_gc-itest

David Alves Wed, 12 Oct 2016 17:45:51 -0700

Your help would be most welcome Mike.
Didn't have to look at it yet, but share Todd's concern that this is
worrying if single node.


-david

On Wed, Oct 12, 2016 at 5:39 PM, Mike Percy <[email protected]> wrote:

> Please let me know if you want me to dig into this; I likely won't have any
> large chunks of time to spend on this before Monday, though.
>
> Mike
>
> --
> Mike Percy
> Software Engineer, Cloudera
>
>
> On Wed, Oct 12, 2016 at 11:23 PM, David Alves <[email protected]>
> wrote:
>
> > No, you're right. I was looking at another test in that file.
> > Hum, then the opportunity for strange things to happen is much lower.
> > Will look more closely.
> >
> > -david
> >
> > On Wed, Oct 12, 2016 at 1:59 PM, Alexey Serbin <[email protected]>
> > wrote:
> >
> > > Hi David,
> > >
> > > Thank you for taking a look at that.
> > > I think the test already uses just one tablet server, so no replicas
> > would
> > > be possible.  I see the following code in the test:
> > >
> > >   StartCluster(1); // Start MiniCluster with a single tablet server.
> > >
> > >   TestWorkload workload(cluster_.get());
> > >
> > >   workload.set_num_replicas(1);
> > >
> > >   workload.Setup(); // Convenient way to create a table.
> > >
> > >
> > >
> > > Did I miss something?  I.e. should I toggle just another control knob
> > > somewhere?
> > >
> > >
> > > Thanks,
> > >
> > > Alexey
> > >
> > > On Wed, Oct 12, 2016 at 1:43 PM, David Alves <[email protected]>
> > > wrote:
> > >
> > > > Hi Alexey
> > > >
> > > > Thanks for going down the rabbit hole.
> > > > Could you try your patch without tablet replication for that test? If
> > the
> > > > problem persists it's unlikely that it's related to the current
> > > consistency
> > > > gaps we have.
> > > > I'm a bit suspicious in that it seems to be doing snapshot scans
> > without
> > > > retrying, which is what we're doing pretty much everywhere else to
> work
> > > > around our gaps.
> > > >
> > > > -david
> > > >
> > > > On Wed, Oct 12, 2016 at 1:06 PM, Alexey Serbin <[email protected]
> >
> > > > wrote:
> > > >
> > > > > One small update: the issue might be not in GC logic, but some
> other
> > > > > flakiness related to reading data at snapshot.
> > > > >
> > > > > I updated the patch so the only operations the test now does are
> > > inserts,
> > > > > updates and scans. No tablet merge compactions, redo delta
> > compactions,
> > > > > forced re-updates of missing deltas, or moving time forward.  The
> > > updated
> > > > > patch can be found at:
> > > > >   https://gist.github.com/alexeyserbin/
> > 06ed8dbdb0e8e9abcbde2991c66156
> > > 60
> > > > >
> > > > > The test firmly fails if running as described in the previous
> message
> > > in
> > > > > this thread, just use the updated patch location.
> > > > >
> > > > > David, may be you can take a quick look at that as well?
> > > > >
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Alexey
> > > > >
> > > > > On Wed, Oct 12, 2016 at 2:01 AM, Alexey Serbin <
> [email protected]
> > >
> > > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I played with the test (mostly in background), making the failure
> > > > almost
> > > > > > 100% reproducible.
> > > > > >
> > > > > > After collecting some evidence, I can say it's a server-side bug.
> > I
> > > > > think
> > > > > > so because the reproduction scenario I'm talking about uses good
> > old
> > > > > > MANUAL_FLUSH mode, not AUTO_FLUSH_BACKGROUND mode.  Yes, I've
> > > modified
> > > > > the
> > > > > > test slightly to achieve higher reproduction ratio and to clear
> the
> > > > > > question whether it's AUTO_FLUSH_BACKGROUND-specific bug.
> > > > > >
> > > > > > That's what I found:
> > > > > >   1. The problem occurs when updating rows with the same primary
> > keys
> > > > > > multiple times.
> > > > > >   2. It's crucial to flush (i.e. call KuduSession::Flush() or
> > > > > > KuduSession::FlushAsync()) freshly applied update operations not
> > just
> > > > > once
> > > > > > in the very end of a client session, but multiple times while
> > adding
> > > > > those
> > > > > > operations.  If flushing just once in the very end, the issue
> > becomes
> > > > 0%
> > > > > > reproducible.
> > > > > >   3. The more updates for different rows we have, the more likely
> > we
> > > > hit
> > > > > > the issue (but there should be at least a couple updates for
> every
> > > > row).
> > > > > >   4. The problem persists in all types of Kudu builds: debug,
> TSAN,
> > > > > > release, ASAN (in the decreasing order of the reproduction
> ratio).
> > > > > >   5. The problem is also highly reproducible if running the test
> > via
> > > > the
> > > > > > dist_test.py utility (check for 256 out of 256 failure ratio at
> > > > > > http://dist-test.cloudera.org//job?job_id=aserbin.
> 1476258983.2603
> > )
> > > > > >
> > > > > > To build the modified test and run the reproduction scenario:
> > > > > >   1. Get the patch from https://gist.github.com/alexeyserbin/
> > > > > > 7c885148dadff8705912f6cc513108d0
> > > > > >   2. Apply the patch to the latest Kudu source from the master
> > > branch.
> > > > > >   3. Build debug, TSAN, release or ASAN configuration and run
> with
> > > the
> > > > > > command (the random seed is not really crucial, but this gives
> > better
> > > > > > results):
> > > > > >     ../../build-support/run-test.sh
> ./bin/tablet_history_gc-itest
> > > > > > --gtest_filter=RandomizedTabletHistoryGcITest
> > > > > .TestRandomHistoryGCWorkload
> > > > > > --stress_cpu_threads=64 --test_random_seed=1213726993
> > > > > >
> > > > > > 4. If running via dist_test.py, run the following instead:
> > > > > >
> > > > > >     ../../build-support/dist_test.py loop -n 256 --
> > > > > > ./bin/tablet_history_gc-itest --gtest_filter=
> > > > > > RandomizedTabletHistoryGcITest.TestRandomHistoryGCWorkload
> > > > > > --stress_cpu_threads=8 --test_random_seed=1213726993
> > > > > >
> > > > > > Mike, it seems I'll need your help to troubleshoot/debug this
> issue
> > > > > > further.
> > > > > >
> > > > > >
> > > > > > Best regards,
> > > > > >
> > > > > > Alexey
> > > > > >
> > > > > >
> > > > > > On Mon, Oct 3, 2016 at 9:48 AM, Alexey Serbin <
> > [email protected]>
> > > > > > wrote:
> > > > > >
> > > > > >> Todd,
> > > > > >>
> > > > > >> I apologize for the late response -- somehow my inbox is messed
> > up.
> > > > > >> Probably, I need to switch to use stand-alone mail application
> (as
> > > > > iMail)
> > > > > >> instead of browser-based one.
> > > > > >>
> > > > > >> Yes, I'll take a look at that.
> > > > > >>
> > > > > >>
> > > > > >> Best regards,
> > > > > >>
> > > > > >> Alexey
> > > > > >>
> > > > > >> On Mon, Sep 26, 2016 at 12:58 PM, Todd Lipcon <
> [email protected]>
> > > > > wrote:
> > > > > >>
> > > > > >>> This test has gotten flaky with a concerning failure mode
> (seeing
> > > > > >>> "wrong" results, not just a timeout or something):
> > > > > >>>
> > > > > >>> http://dist-test.cloudera.org:8080/test_drilldown?test_name=
> > > > > >>> tablet_history_gc-itest
> > > > > >>>
> > > > > >>> It seems like it got flaky starting with Alexey's
> > > > > >>> commit bc14b2f9d775c9f27f2e2be36d4b03080977e8fa which switched
> > it
> > > to
> > > > > >>> use AUTO_FLUSH_BACKGROUND. So perhaps the bug is actually a
> > client
> > > > bug
> > > > > and
> > > > > >>> not anything to do with GC.
> > > > > >>>
> > > > > >>> Alexey, do you have time to take a look, and perhaps consult
> with
> > > > Mike
> > > > > >>> if you think it's actually a server-side bug?
> > > > > >>>
> > > > > >>> -Todd
> > > > > >>>
> > > > > >>> --
> > > > > >>> Todd Lipcon
> > > > > >>> Software Engineer, Cloudera
> > > > > >>>
> > > > > >>
> > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Flaky tablet_history_gc-itest

Reply via email to