Alexey Serbin created KUDU-2923:
-----------------------------------

             Summary: RaftConsensusITest.MultiThreadedInsertWithFailovers is 
flaky
                 Key: KUDU-2923
                 URL: https://issues.apache.org/jira/browse/KUDU-2923
             Project: Kudu
          Issue Type: Bug
          Components: consensus, test
    Affects Versions: 1.11.0
            Reporter: Alexey Serbin
         Attachments: raft_consensus-itest.txt.xz

In a rare cases, the {{}} test scenario crashes with the following output:

{noformat}
I0820 18:21:28.614696  1042 raft_consensus.cc:2890] T 
a0ea2bbe8bad446b8782b902ca670735 P b1747d43333448629e1fb9b7c7dff193 [term 13 
FOLLOWER]: Advancing to term 14
I0820 18:21:28.615350  1042 raft_consensus.cc:1184] T 
a0ea2bbe8bad446b8782b902ca670735 P b1747d43333448629e1fb9b7c7dff193 [term 14 
FOLLOWER]: Refusing update from remote peer 83184eab2eae4146956292754c8fe346: 
Log matching property violated. Preceding OpId in replica: term: 13 index: 
3478. Preceding OpId from leader: term: 14 index: 3480. (index mismatch)
F0820 18:21:28.637754   225 raft_consensus-itest.cc:394] Check failed: _s.ok() 
Bad status: Not found: leader replica not found
*** Check failure stack trace: ***
*** Aborted at 1566325288 (unix time) try "date -d @1566325288" if you are 
using GNU date ***
PC: @     0x7f76cf228c37 gsignal
*** SIGABRT (@0x3e8000000e1) received by PID 225 (TID 0x7f76d45f3000) from PID 
225; stack trace: ***
    @     0x7f76d185a330 (unknown) at ??:0
    @     0x7f76cf228c37 gsignal at ??:0
    @     0x7f76cf22c028 abort at ??:0
    @     0x7f76d0297e09 google::logging_fail() at ??:0
    @     0x7f76d029962d google::LogMessage::Fail() at ??:0
    @     0x7f76d029b64c google::LogMessage::SendToLog() at ??:0
    @     0x7f76d0299189 google::LogMessage::Flush() at ??:0
    @     0x7f76d029bfdf google::LogMessageFatal::~LogMessageFatal() at ??:0
    @           0x42d346 
kudu::tserver::RaftConsensusITest::StopOrKillLeaderAndElectNewOne() at 
/data0/jenkins/workspace/kudu-pre-commit-unittest-RELEASE/src/kudu/integration-tests/raft_consensus-itest.cc:399
 (discriminator 1)
    @           0x43734b 
kudu::tserver::RaftConsensusITest_MultiThreadedInsertWithFailovers_Test::TestBody()
 at 
/data0/jenkins/workspace/kudu-pre-commit-unittest-RELEASE/src/kudu/integration-tests/raft_consensus-itest.cc:1005
    @     0x7f76d0aeeb89 
testing::internal::HandleExceptionsInMethodIfSupported<>() at ??:0
    @     0x7f76d0adf68f testing::Test::Run() at ??:0
    @     0x7f76d0adf74d testing::TestInfo::Run() at ??:0
    @     0x7f76d0adf865 testing::TestCase::Run() at ??:0
    @     0x7f76d0adfb28 testing::internal::UnitTestImpl::RunAllTests() at ??:0
    @     0x7f76d0adfdc9 testing::UnitTest::Run() at ??:0
    @     0x7f76d3d7e502 main at ??:0
    @     0x7f76cf213f45 __libc_start_main at ??:0
    @           0x42adb3 (unknown) at ??:?

{noformat}

It seems the issue is in 
https://github.com/apache/kudu/blob/413396c85d7dd56830f563ec754653b2b0ae26fd/src/kudu/integration-tests/ts_itest-base.cc#L294-L313
 : if a snapshot of a tablet's Raft configuration has been captured during 
leader election, it might end up with no leader replica.  In such case, 
{{TabletServerIntegrationTestBase::GetTabletLeaderAndFollowers()}} returns 
{{Status::NotFound()}} and the test crashes at 
https://github.com/apache/kudu/blob/413396c85d7dd56830f563ec754653b2b0ae26fd/src/kudu/integration-tests/raft_consensus-itest.cc#L394

The full log of the test scanario's run is attached for reference.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to