Your help would be most welcome Mike. Didn't have to look at it yet, but share Todd's concern that this is worrying if single node.
-david On Wed, Oct 12, 2016 at 5:39 PM, Mike Percy <[email protected]> wrote: > Please let me know if you want me to dig into this; I likely won't have any > large chunks of time to spend on this before Monday, though. > > Mike > > -- > Mike Percy > Software Engineer, Cloudera > > > On Wed, Oct 12, 2016 at 11:23 PM, David Alves <[email protected]> > wrote: > > > No, you're right. I was looking at another test in that file. > > Hum, then the opportunity for strange things to happen is much lower. > > Will look more closely. > > > > -david > > > > On Wed, Oct 12, 2016 at 1:59 PM, Alexey Serbin <[email protected]> > > wrote: > > > > > Hi David, > > > > > > Thank you for taking a look at that. > > > I think the test already uses just one tablet server, so no replicas > > would > > > be possible. I see the following code in the test: > > > > > > StartCluster(1); // Start MiniCluster with a single tablet server. > > > > > > TestWorkload workload(cluster_.get()); > > > > > > workload.set_num_replicas(1); > > > > > > workload.Setup(); // Convenient way to create a table. > > > > > > > > > > > > Did I miss something? I.e. should I toggle just another control knob > > > somewhere? > > > > > > > > > Thanks, > > > > > > Alexey > > > > > > On Wed, Oct 12, 2016 at 1:43 PM, David Alves <[email protected]> > > > wrote: > > > > > > > Hi Alexey > > > > > > > > Thanks for going down the rabbit hole. > > > > Could you try your patch without tablet replication for that test? If > > the > > > > problem persists it's unlikely that it's related to the current > > > consistency > > > > gaps we have. > > > > I'm a bit suspicious in that it seems to be doing snapshot scans > > without > > > > retrying, which is what we're doing pretty much everywhere else to > work > > > > around our gaps. > > > > > > > > -david > > > > > > > > On Wed, Oct 12, 2016 at 1:06 PM, Alexey Serbin <[email protected] > > > > > > wrote: > > > > > > > > > One small update: the issue might be not in GC logic, but some > other > > > > > flakiness related to reading data at snapshot. > > > > > > > > > > I updated the patch so the only operations the test now does are > > > inserts, > > > > > updates and scans. No tablet merge compactions, redo delta > > compactions, > > > > > forced re-updates of missing deltas, or moving time forward. The > > > updated > > > > > patch can be found at: > > > > > https://gist.github.com/alexeyserbin/ > > 06ed8dbdb0e8e9abcbde2991c66156 > > > 60 > > > > > > > > > > The test firmly fails if running as described in the previous > message > > > in > > > > > this thread, just use the updated patch location. > > > > > > > > > > David, may be you can take a quick look at that as well? > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > Alexey > > > > > > > > > > On Wed, Oct 12, 2016 at 2:01 AM, Alexey Serbin < > [email protected] > > > > > > > > wrote: > > > > > > > > > > > Hi, > > > > > > > > > > > > I played with the test (mostly in background), making the failure > > > > almost > > > > > > 100% reproducible. > > > > > > > > > > > > After collecting some evidence, I can say it's a server-side bug. > > I > > > > > think > > > > > > so because the reproduction scenario I'm talking about uses good > > old > > > > > > MANUAL_FLUSH mode, not AUTO_FLUSH_BACKGROUND mode. Yes, I've > > > modified > > > > > the > > > > > > test slightly to achieve higher reproduction ratio and to clear > the > > > > > > question whether it's AUTO_FLUSH_BACKGROUND-specific bug. > > > > > > > > > > > > That's what I found: > > > > > > 1. The problem occurs when updating rows with the same primary > > keys > > > > > > multiple times. > > > > > > 2. It's crucial to flush (i.e. call KuduSession::Flush() or > > > > > > KuduSession::FlushAsync()) freshly applied update operations not > > just > > > > > once > > > > > > in the very end of a client session, but multiple times while > > adding > > > > > those > > > > > > operations. If flushing just once in the very end, the issue > > becomes > > > > 0% > > > > > > reproducible. > > > > > > 3. The more updates for different rows we have, the more likely > > we > > > > hit > > > > > > the issue (but there should be at least a couple updates for > every > > > > row). > > > > > > 4. The problem persists in all types of Kudu builds: debug, > TSAN, > > > > > > release, ASAN (in the decreasing order of the reproduction > ratio). > > > > > > 5. The problem is also highly reproducible if running the test > > via > > > > the > > > > > > dist_test.py utility (check for 256 out of 256 failure ratio at > > > > > > http://dist-test.cloudera.org//job?job_id=aserbin. > 1476258983.2603 > > ) > > > > > > > > > > > > To build the modified test and run the reproduction scenario: > > > > > > 1. Get the patch from https://gist.github.com/alexeyserbin/ > > > > > > 7c885148dadff8705912f6cc513108d0 > > > > > > 2. Apply the patch to the latest Kudu source from the master > > > branch. > > > > > > 3. Build debug, TSAN, release or ASAN configuration and run > with > > > the > > > > > > command (the random seed is not really crucial, but this gives > > better > > > > > > results): > > > > > > ../../build-support/run-test.sh > ./bin/tablet_history_gc-itest > > > > > > --gtest_filter=RandomizedTabletHistoryGcITest > > > > > .TestRandomHistoryGCWorkload > > > > > > --stress_cpu_threads=64 --test_random_seed=1213726993 > > > > > > > > > > > > 4. If running via dist_test.py, run the following instead: > > > > > > > > > > > > ../../build-support/dist_test.py loop -n 256 -- > > > > > > ./bin/tablet_history_gc-itest --gtest_filter= > > > > > > RandomizedTabletHistoryGcITest.TestRandomHistoryGCWorkload > > > > > > --stress_cpu_threads=8 --test_random_seed=1213726993 > > > > > > > > > > > > Mike, it seems I'll need your help to troubleshoot/debug this > issue > > > > > > further. > > > > > > > > > > > > > > > > > > Best regards, > > > > > > > > > > > > Alexey > > > > > > > > > > > > > > > > > > On Mon, Oct 3, 2016 at 9:48 AM, Alexey Serbin < > > [email protected]> > > > > > > wrote: > > > > > > > > > > > >> Todd, > > > > > >> > > > > > >> I apologize for the late response -- somehow my inbox is messed > > up. > > > > > >> Probably, I need to switch to use stand-alone mail application > (as > > > > > iMail) > > > > > >> instead of browser-based one. > > > > > >> > > > > > >> Yes, I'll take a look at that. > > > > > >> > > > > > >> > > > > > >> Best regards, > > > > > >> > > > > > >> Alexey > > > > > >> > > > > > >> On Mon, Sep 26, 2016 at 12:58 PM, Todd Lipcon < > [email protected]> > > > > > wrote: > > > > > >> > > > > > >>> This test has gotten flaky with a concerning failure mode > (seeing > > > > > >>> "wrong" results, not just a timeout or something): > > > > > >>> > > > > > >>> http://dist-test.cloudera.org:8080/test_drilldown?test_name= > > > > > >>> tablet_history_gc-itest > > > > > >>> > > > > > >>> It seems like it got flaky starting with Alexey's > > > > > >>> commit bc14b2f9d775c9f27f2e2be36d4b03080977e8fa which switched > > it > > > to > > > > > >>> use AUTO_FLUSH_BACKGROUND. So perhaps the bug is actually a > > client > > > > bug > > > > > and > > > > > >>> not anything to do with GC. > > > > > >>> > > > > > >>> Alexey, do you have time to take a look, and perhaps consult > with > > > > Mike > > > > > >>> if you think it's actually a server-side bug? > > > > > >>> > > > > > >>> -Todd > > > > > >>> > > > > > >>> -- > > > > > >>> Todd Lipcon > > > > > >>> Software Engineer, Cloudera > > > > > >>> > > > > > >> > > > > > >> > > > > > > > > > > > > > > > > > > > > >
