Thanks for going down the rabbit hole.
Could you try your patch without tablet replication for that test? If the
problem persists it's unlikely that it's related to the current consistency
gaps we have.
I'm a bit suspicious in that it seems to be doing snapshot scans without
retrying, which is what we're doing pretty much everywhere else to work
around our gaps.
On Wed, Oct 12, 2016 at 1:06 PM, Alexey Serbin <aser...@cloudera.com> wrote:
> One small update: the issue might be not in GC logic, but some other
> flakiness related to reading data at snapshot.
> I updated the patch so the only operations the test now does are inserts,
> updates and scans. No tablet merge compactions, redo delta compactions,
> forced re-updates of missing deltas, or moving time forward. The updated
> patch can be found at:
> The test firmly fails if running as described in the previous message in
> this thread, just use the updated patch location.
> David, may be you can take a quick look at that as well?
> On Wed, Oct 12, 2016 at 2:01 AM, Alexey Serbin <aser...@cloudera.com>
> > Hi,
> > I played with the test (mostly in background), making the failure almost
> > 100% reproducible.
> > After collecting some evidence, I can say it's a server-side bug. I
> > so because the reproduction scenario I'm talking about uses good old
> > MANUAL_FLUSH mode, not AUTO_FLUSH_BACKGROUND mode. Yes, I've modified
> > test slightly to achieve higher reproduction ratio and to clear the
> > question whether it's AUTO_FLUSH_BACKGROUND-specific bug.
> > That's what I found:
> > 1. The problem occurs when updating rows with the same primary keys
> > multiple times.
> > 2. It's crucial to flush (i.e. call KuduSession::Flush() or
> > KuduSession::FlushAsync()) freshly applied update operations not just
> > in the very end of a client session, but multiple times while adding
> > operations. If flushing just once in the very end, the issue becomes 0%
> > reproducible.
> > 3. The more updates for different rows we have, the more likely we hit
> > the issue (but there should be at least a couple updates for every row).
> > 4. The problem persists in all types of Kudu builds: debug, TSAN,
> > release, ASAN (in the decreasing order of the reproduction ratio).
> > 5. The problem is also highly reproducible if running the test via the
> > dist_test.py utility (check for 256 out of 256 failure ratio at
> > http://dist-test.cloudera.org//job?job_id=aserbin.1476258983.2603 )
> > To build the modified test and run the reproduction scenario:
> > 1. Get the patch from https://gist.github.com/alexeyserbin/
> > 7c885148dadff8705912f6cc513108d0
> > 2. Apply the patch to the latest Kudu source from the master branch.
> > 3. Build debug, TSAN, release or ASAN configuration and run with the
> > command (the random seed is not really crucial, but this gives better
> > results):
> > ../../build-support/run-test.sh ./bin/tablet_history_gc-itest
> > --gtest_filter=RandomizedTabletHistoryGcITest
> > --stress_cpu_threads=64 --test_random_seed=1213726993
> > 4. If running via dist_test.py, run the following instead:
> > ../../build-support/dist_test.py loop -n 256 --
> > ./bin/tablet_history_gc-itest --gtest_filter=
> > RandomizedTabletHistoryGcITest.TestRandomHistoryGCWorkload
> > --stress_cpu_threads=8 --test_random_seed=1213726993
> > Mike, it seems I'll need your help to troubleshoot/debug this issue
> > further.
> > Best regards,
> > Alexey
> > On Mon, Oct 3, 2016 at 9:48 AM, Alexey Serbin <aser...@cloudera.com>
> > wrote:
> >> Todd,
> >> I apologize for the late response -- somehow my inbox is messed up.
> >> Probably, I need to switch to use stand-alone mail application (as
> >> instead of browser-based one.
> >> Yes, I'll take a look at that.
> >> Best regards,
> >> Alexey
> >> On Mon, Sep 26, 2016 at 12:58 PM, Todd Lipcon <t...@cloudera.com>
> >>> This test has gotten flaky with a concerning failure mode (seeing
> >>> "wrong" results, not just a timeout or something):
> >>> http://dist-test.cloudera.org:8080/test_drilldown?test_name=
> >>> tablet_history_gc-itest
> >>> It seems like it got flaky starting with Alexey's
> >>> commit bc14b2f9d775c9f27f2e2be36d4b03080977e8fa which switched it to
> >>> use AUTO_FLUSH_BACKGROUND. So perhaps the bug is actually a client bug
> >>> not anything to do with GC.
> >>> Alexey, do you have time to take a look, and perhaps consult with Mike
> >>> if you think it's actually a server-side bug?
> >>> -Todd
> >>> --
> >>> Todd Lipcon
> >>> Software Engineer, Cloudera