Hi,

I played with the test (mostly in background), making the failure almost
100% reproducible.

After collecting some evidence, I can say it's a server-side bug.  I think
so because the reproduction scenario I'm talking about uses good old
MANUAL_FLUSH mode, not AUTO_FLUSH_BACKGROUND mode.  Yes, I've modified the
test slightly to achieve higher reproduction ratio and to clear the
question whether it's AUTO_FLUSH_BACKGROUND-specific bug.

That's what I found:
  1. The problem occurs when updating rows with the same primary keys
multiple times.
  2. It's crucial to flush (i.e. call KuduSession::Flush() or
KuduSession::FlushAsync()) freshly applied update operations not just once
in the very end of a client session, but multiple times while adding those
operations.  If flushing just once in the very end, the issue becomes 0%
reproducible.
  3. The more updates for different rows we have, the more likely we hit
the issue (but there should be at least a couple updates for every row).
  4. The problem persists in all types of Kudu builds: debug, TSAN,
release, ASAN (in the decreasing order of the reproduction ratio).
  5. The problem is also highly reproducible if running the test via the
dist_test.py utility (check for 256 out of 256 failure ratio at
http://dist-test.cloudera.org//job?job_id=aserbin.1476258983.2603 )

To build the modified test and run the reproduction scenario:
  1. Get the patch from
https://gist.github.com/alexeyserbin/7c885148dadff8705912f6cc513108d0
  2. Apply the patch to the latest Kudu source from the master branch.
  3. Build debug, TSAN, release or ASAN configuration and run with the
command (the random seed is not really crucial, but this gives better
results):
    ../../build-support/run-test.sh ./bin/tablet_history_gc-itest
--gtest_filter=RandomizedTabletHistoryGcITest.TestRandomHistoryGCWorkload
--stress_cpu_threads=64 --test_random_seed=1213726993

4. If running via dist_test.py, run the following instead:

    ../../build-support/dist_test.py loop -n 256 --
./bin/tablet_history_gc-itest
--gtest_filter=RandomizedTabletHistoryGcITest.TestRandomHistoryGCWorkload
--stress_cpu_threads=8 --test_random_seed=1213726993

Mike, it seems I'll need your help to troubleshoot/debug this issue further.


Best regards,

Alexey


On Mon, Oct 3, 2016 at 9:48 AM, Alexey Serbin <aser...@cloudera.com> wrote:

> Todd,
>
> I apologize for the late response -- somehow my inbox is messed up.
> Probably, I need to switch to use stand-alone mail application (as iMail)
> instead of browser-based one.
>
> Yes, I'll take a look at that.
>
>
> Best regards,
>
> Alexey
>
> On Mon, Sep 26, 2016 at 12:58 PM, Todd Lipcon <t...@cloudera.com> wrote:
>
>> This test has gotten flaky with a concerning failure mode (seeing "wrong"
>> results, not just a timeout or something):
>>
>> http://dist-test.cloudera.org:8080/test_drilldown?test_name=
>> tablet_history_gc-itest
>>
>> It seems like it got flaky starting with Alexey's
>> commit bc14b2f9d775c9f27f2e2be36d4b03080977e8fa which switched it to use
>> AUTO_FLUSH_BACKGROUND. So perhaps the bug is actually a client bug and not
>> anything to do with GC.
>>
>> Alexey, do you have time to take a look, and perhaps consult with Mike if
>> you think it's actually a server-side bug?
>>
>> -Todd
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>
>

Reply via email to