One small update: the issue might be not in GC logic, but some other
flakiness related to reading data at snapshot.

I updated the patch so the only operations the test now does are inserts,
updates and scans. No tablet merge compactions, redo delta compactions,
forced re-updates of missing deltas, or moving time forward.  The updated
patch can be found at:

The test firmly fails if running as described in the previous message in
this thread, just use the updated patch location.

David, may be you can take a quick look at that as well?



On Wed, Oct 12, 2016 at 2:01 AM, Alexey Serbin <> wrote:

> Hi,
> I played with the test (mostly in background), making the failure almost
> 100% reproducible.
> After collecting some evidence, I can say it's a server-side bug.  I think
> so because the reproduction scenario I'm talking about uses good old
> MANUAL_FLUSH mode, not AUTO_FLUSH_BACKGROUND mode.  Yes, I've modified the
> test slightly to achieve higher reproduction ratio and to clear the
> question whether it's AUTO_FLUSH_BACKGROUND-specific bug.
> That's what I found:
>   1. The problem occurs when updating rows with the same primary keys
> multiple times.
>   2. It's crucial to flush (i.e. call KuduSession::Flush() or
> KuduSession::FlushAsync()) freshly applied update operations not just once
> in the very end of a client session, but multiple times while adding those
> operations.  If flushing just once in the very end, the issue becomes 0%
> reproducible.
>   3. The more updates for different rows we have, the more likely we hit
> the issue (but there should be at least a couple updates for every row).
>   4. The problem persists in all types of Kudu builds: debug, TSAN,
> release, ASAN (in the decreasing order of the reproduction ratio).
>   5. The problem is also highly reproducible if running the test via the
> utility (check for 256 out of 256 failure ratio at
> )
> To build the modified test and run the reproduction scenario:
>   1. Get the patch from
> 7c885148dadff8705912f6cc513108d0
>   2. Apply the patch to the latest Kudu source from the master branch.
>   3. Build debug, TSAN, release or ASAN configuration and run with the
> command (the random seed is not really crucial, but this gives better
> results):
>     ../../build-support/ ./bin/tablet_history_gc-itest
> --gtest_filter=RandomizedTabletHistoryGcITest.TestRandomHistoryGCWorkload
> --stress_cpu_threads=64 --test_random_seed=1213726993
> 4. If running via, run the following instead:
>     ../../build-support/ loop -n 256 --
> ./bin/tablet_history_gc-itest --gtest_filter=
> RandomizedTabletHistoryGcITest.TestRandomHistoryGCWorkload
> --stress_cpu_threads=8 --test_random_seed=1213726993
> Mike, it seems I'll need your help to troubleshoot/debug this issue
> further.
> Best regards,
> Alexey
> On Mon, Oct 3, 2016 at 9:48 AM, Alexey Serbin <>
> wrote:
>> Todd,
>> I apologize for the late response -- somehow my inbox is messed up.
>> Probably, I need to switch to use stand-alone mail application (as iMail)
>> instead of browser-based one.
>> Yes, I'll take a look at that.
>> Best regards,
>> Alexey
>> On Mon, Sep 26, 2016 at 12:58 PM, Todd Lipcon <> wrote:
>>> This test has gotten flaky with a concerning failure mode (seeing
>>> "wrong" results, not just a timeout or something):
>>> tablet_history_gc-itest
>>> It seems like it got flaky starting with Alexey's
>>> commit bc14b2f9d775c9f27f2e2be36d4b03080977e8fa which switched it to
>>> use AUTO_FLUSH_BACKGROUND. So perhaps the bug is actually a client bug and
>>> not anything to do with GC.
>>> Alexey, do you have time to take a look, and perhaps consult with Mike
>>> if you think it's actually a server-side bug?
>>> -Todd
>>> --
>>> Todd Lipcon
>>> Software Engineer, Cloudera

Reply via email to