[
https://issues.apache.org/jira/browse/KUDU-2923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alexey Serbin updated KUDU-2923:
--------------------------------
Description:
In a rare cases, the {{RaftConsensusITest.MultiThreadedInsertWithFailovers}}
test scenario crashes with the following output:
{noformat}
I0820 18:21:28.614696 1042 raft_consensus.cc:2890] T
a0ea2bbe8bad446b8782b902ca670735 P b1747d43333448629e1fb9b7c7dff193 [term 13
FOLLOWER]: Advancing to term 14
I0820 18:21:28.615350 1042 raft_consensus.cc:1184] T
a0ea2bbe8bad446b8782b902ca670735 P b1747d43333448629e1fb9b7c7dff193 [term 14
FOLLOWER]: Refusing update from remote peer 83184eab2eae4146956292754c8fe346:
Log matching property violated. Preceding OpId in replica: term: 13 index:
3478. Preceding OpId from leader: term: 14 index: 3480. (index mismatch)
F0820 18:21:28.637754 225 raft_consensus-itest.cc:394] Check failed: _s.ok()
Bad status: Not found: leader replica not found
*** Check failure stack trace: ***
*** Aborted at 1566325288 (unix time) try "date -d @1566325288" if you are
using GNU date ***
PC: @ 0x7f76cf228c37 gsignal
*** SIGABRT (@0x3e8000000e1) received by PID 225 (TID 0x7f76d45f3000) from PID
225; stack trace: ***
@ 0x7f76d185a330 (unknown) at ??:0
@ 0x7f76cf228c37 gsignal at ??:0
@ 0x7f76cf22c028 abort at ??:0
@ 0x7f76d0297e09 google::logging_fail() at ??:0
@ 0x7f76d029962d google::LogMessage::Fail() at ??:0
@ 0x7f76d029b64c google::LogMessage::SendToLog() at ??:0
@ 0x7f76d0299189 google::LogMessage::Flush() at ??:0
@ 0x7f76d029bfdf google::LogMessageFatal::~LogMessageFatal() at ??:0
@ 0x42d346
kudu::tserver::RaftConsensusITest::StopOrKillLeaderAndElectNewOne() at
/data0/jenkins/workspace/kudu-pre-commit-unittest-RELEASE/src/kudu/integration-tests/raft_consensus-itest.cc:399
(discriminator 1)
@ 0x43734b
kudu::tserver::RaftConsensusITest_MultiThreadedInsertWithFailovers_Test::TestBody()
at
/data0/jenkins/workspace/kudu-pre-commit-unittest-RELEASE/src/kudu/integration-tests/raft_consensus-itest.cc:1005
@ 0x7f76d0aeeb89
testing::internal::HandleExceptionsInMethodIfSupported<>() at ??:0
@ 0x7f76d0adf68f testing::Test::Run() at ??:0
@ 0x7f76d0adf74d testing::TestInfo::Run() at ??:0
@ 0x7f76d0adf865 testing::TestCase::Run() at ??:0
@ 0x7f76d0adfb28 testing::internal::UnitTestImpl::RunAllTests() at ??:0
@ 0x7f76d0adfdc9 testing::UnitTest::Run() at ??:0
@ 0x7f76d3d7e502 main at ??:0
@ 0x7f76cf213f45 __libc_start_main at ??:0
@ 0x42adb3 (unknown) at ??:?
{noformat}
It seems the issue is in
https://github.com/apache/kudu/blob/413396c85d7dd56830f563ec754653b2b0ae26fd/src/kudu/integration-tests/ts_itest-base.cc#L294-L313
: if a snapshot of a tablet's Raft configuration has been captured during
leader election, it might end up with no leader replica. In such case,
{{TabletServerIntegrationTestBase::GetTabletLeaderAndFollowers()}} returns
{{Status::NotFound()}} and the test crashes at
https://github.com/apache/kudu/blob/413396c85d7dd56830f563ec754653b2b0ae26fd/src/kudu/integration-tests/raft_consensus-itest.cc#L394
The full log of the test scanario's run is attached for reference.
was:
In a rare cases, the {{}} test scenario crashes with the following output:
{noformat}
I0820 18:21:28.614696 1042 raft_consensus.cc:2890] T
a0ea2bbe8bad446b8782b902ca670735 P b1747d43333448629e1fb9b7c7dff193 [term 13
FOLLOWER]: Advancing to term 14
I0820 18:21:28.615350 1042 raft_consensus.cc:1184] T
a0ea2bbe8bad446b8782b902ca670735 P b1747d43333448629e1fb9b7c7dff193 [term 14
FOLLOWER]: Refusing update from remote peer 83184eab2eae4146956292754c8fe346:
Log matching property violated. Preceding OpId in replica: term: 13 index:
3478. Preceding OpId from leader: term: 14 index: 3480. (index mismatch)
F0820 18:21:28.637754 225 raft_consensus-itest.cc:394] Check failed: _s.ok()
Bad status: Not found: leader replica not found
*** Check failure stack trace: ***
*** Aborted at 1566325288 (unix time) try "date -d @1566325288" if you are
using GNU date ***
PC: @ 0x7f76cf228c37 gsignal
*** SIGABRT (@0x3e8000000e1) received by PID 225 (TID 0x7f76d45f3000) from PID
225; stack trace: ***
@ 0x7f76d185a330 (unknown) at ??:0
@ 0x7f76cf228c37 gsignal at ??:0
@ 0x7f76cf22c028 abort at ??:0
@ 0x7f76d0297e09 google::logging_fail() at ??:0
@ 0x7f76d029962d google::LogMessage::Fail() at ??:0
@ 0x7f76d029b64c google::LogMessage::SendToLog() at ??:0
@ 0x7f76d0299189 google::LogMessage::Flush() at ??:0
@ 0x7f76d029bfdf google::LogMessageFatal::~LogMessageFatal() at ??:0
@ 0x42d346
kudu::tserver::RaftConsensusITest::StopOrKillLeaderAndElectNewOne() at
/data0/jenkins/workspace/kudu-pre-commit-unittest-RELEASE/src/kudu/integration-tests/raft_consensus-itest.cc:399
(discriminator 1)
@ 0x43734b
kudu::tserver::RaftConsensusITest_MultiThreadedInsertWithFailovers_Test::TestBody()
at
/data0/jenkins/workspace/kudu-pre-commit-unittest-RELEASE/src/kudu/integration-tests/raft_consensus-itest.cc:1005
@ 0x7f76d0aeeb89
testing::internal::HandleExceptionsInMethodIfSupported<>() at ??:0
@ 0x7f76d0adf68f testing::Test::Run() at ??:0
@ 0x7f76d0adf74d testing::TestInfo::Run() at ??:0
@ 0x7f76d0adf865 testing::TestCase::Run() at ??:0
@ 0x7f76d0adfb28 testing::internal::UnitTestImpl::RunAllTests() at ??:0
@ 0x7f76d0adfdc9 testing::UnitTest::Run() at ??:0
@ 0x7f76d3d7e502 main at ??:0
@ 0x7f76cf213f45 __libc_start_main at ??:0
@ 0x42adb3 (unknown) at ??:?
{noformat}
It seems the issue is in
https://github.com/apache/kudu/blob/413396c85d7dd56830f563ec754653b2b0ae26fd/src/kudu/integration-tests/ts_itest-base.cc#L294-L313
: if a snapshot of a tablet's Raft configuration has been captured during
leader election, it might end up with no leader replica. In such case,
{{TabletServerIntegrationTestBase::GetTabletLeaderAndFollowers()}} returns
{{Status::NotFound()}} and the test crashes at
https://github.com/apache/kudu/blob/413396c85d7dd56830f563ec754653b2b0ae26fd/src/kudu/integration-tests/raft_consensus-itest.cc#L394
The full log of the test scanario's run is attached for reference.
> RaftConsensusITest.MultiThreadedInsertWithFailovers is flaky
> ------------------------------------------------------------
>
> Key: KUDU-2923
> URL: https://issues.apache.org/jira/browse/KUDU-2923
> Project: Kudu
> Issue Type: Bug
> Components: consensus, test
> Affects Versions: 1.11.0
> Reporter: Alexey Serbin
> Priority: Minor
> Attachments: raft_consensus-itest.txt.xz
>
>
> In a rare cases, the {{RaftConsensusITest.MultiThreadedInsertWithFailovers}}
> test scenario crashes with the following output:
> {noformat}
> I0820 18:21:28.614696 1042 raft_consensus.cc:2890] T
> a0ea2bbe8bad446b8782b902ca670735 P b1747d43333448629e1fb9b7c7dff193 [term 13
> FOLLOWER]: Advancing to term 14
> I0820 18:21:28.615350 1042 raft_consensus.cc:1184] T
> a0ea2bbe8bad446b8782b902ca670735 P b1747d43333448629e1fb9b7c7dff193 [term 14
> FOLLOWER]: Refusing update from remote peer 83184eab2eae4146956292754c8fe346:
> Log matching property violated. Preceding OpId in replica: term: 13 index:
> 3478. Preceding OpId from leader: term: 14 index: 3480. (index mismatch)
> F0820 18:21:28.637754 225 raft_consensus-itest.cc:394] Check failed:
> _s.ok() Bad status: Not found: leader replica not found
> *** Check failure stack trace: ***
> *** Aborted at 1566325288 (unix time) try "date -d @1566325288" if you are
> using GNU date ***
> PC: @ 0x7f76cf228c37 gsignal
> *** SIGABRT (@0x3e8000000e1) received by PID 225 (TID 0x7f76d45f3000) from
> PID 225; stack trace: ***
> @ 0x7f76d185a330 (unknown) at ??:0
> @ 0x7f76cf228c37 gsignal at ??:0
> @ 0x7f76cf22c028 abort at ??:0
> @ 0x7f76d0297e09 google::logging_fail() at ??:0
> @ 0x7f76d029962d google::LogMessage::Fail() at ??:0
> @ 0x7f76d029b64c google::LogMessage::SendToLog() at ??:0
> @ 0x7f76d0299189 google::LogMessage::Flush() at ??:0
> @ 0x7f76d029bfdf google::LogMessageFatal::~LogMessageFatal() at ??:0
> @ 0x42d346
> kudu::tserver::RaftConsensusITest::StopOrKillLeaderAndElectNewOne() at
> /data0/jenkins/workspace/kudu-pre-commit-unittest-RELEASE/src/kudu/integration-tests/raft_consensus-itest.cc:399
> (discriminator 1)
> @ 0x43734b
> kudu::tserver::RaftConsensusITest_MultiThreadedInsertWithFailovers_Test::TestBody()
> at
> /data0/jenkins/workspace/kudu-pre-commit-unittest-RELEASE/src/kudu/integration-tests/raft_consensus-itest.cc:1005
> @ 0x7f76d0aeeb89
> testing::internal::HandleExceptionsInMethodIfSupported<>() at ??:0
> @ 0x7f76d0adf68f testing::Test::Run() at ??:0
> @ 0x7f76d0adf74d testing::TestInfo::Run() at ??:0
> @ 0x7f76d0adf865 testing::TestCase::Run() at ??:0
> @ 0x7f76d0adfb28 testing::internal::UnitTestImpl::RunAllTests() at
> ??:0
> @ 0x7f76d0adfdc9 testing::UnitTest::Run() at ??:0
> @ 0x7f76d3d7e502 main at ??:0
> @ 0x7f76cf213f45 __libc_start_main at ??:0
> @ 0x42adb3 (unknown) at ??:?
> {noformat}
> It seems the issue is in
> https://github.com/apache/kudu/blob/413396c85d7dd56830f563ec754653b2b0ae26fd/src/kudu/integration-tests/ts_itest-base.cc#L294-L313
> : if a snapshot of a tablet's Raft configuration has been captured during
> leader election, it might end up with no leader replica. In such case,
> {{TabletServerIntegrationTestBase::GetTabletLeaderAndFollowers()}} returns
> {{Status::NotFound()}} and the test crashes at
> https://github.com/apache/kudu/blob/413396c85d7dd56830f563ec754653b2b0ae26fd/src/kudu/integration-tests/raft_consensus-itest.cc#L394
> The full log of the test scanario's run is attached for reference.
--
This message was sent by Atlassian Jira
(v8.3.2#803003)