Hello Mike Percy, Todd Lipcon,
I'd like you to do a code review. Please visit
to review the following change.
Change subject: catalog_manager: prevent spurious dirty callbacks from crashing
catalog_manager: prevent spurious dirty callbacks from crashing the process
The recent change to switch from LocalConsensus to RaftConsensus has led to
an increased number of dirty callback calls when using just one master. Some
of these calls occur with no term change; the catalog manager treats them as
any other calls and reloads the on-disk metadata.
In theory that's just unnecessary work, but CheckIsLeaderAndReady() doesn't
provide adequate protection for the rest of the master when in this state, so
nearly every RPC is allowed in during this time. That's an absolute disaster
for correctness: imagine a GetTableLocations() returning only a subset of a
table's tablets because the rest were still being loaded from disk.
Luckily for us it can also manifest as a crash  so we noticed it quickly.
I chose to fix it by ignoring these calls within
CatalogManager::VisitTablesAndTabletsTask and not
SysCatalogTable::SysCatalogStateChanged because the synchronization in the
former is more straight forward thanks to the size of worker_pool_. The new
test led to a crash 100% of the time without this fix.
There's an argument to be made for changing TableInfo::TabletMap's raw
pointers to shared pointers thus avoiding this crash altogether, but it's
still an incorrect state to be in, so I don't see the value in doing that.
While I was here, I snuck a few other changes in:
- Remove a lock acquisition from ElectedAsLeaderCb; it did nothing.
- Remove old_role_ from SysCatalogTable; it also did nothing.
- Narrow the lock acquisition in VisitTablesAndTabletsTask; it only needs
to protect the visiting logic.
1. Sample crash output:
F0618 05:47:41.795367 9330 ref_counted.cc:74] Check failed: !in_dtor_
*** Check failure stack trace: ***
@ 0x7f99fab05f7d google::LogMessage::Fail() at ??:0
@ 0x7f99fab07e7d google::LogMessage::SendToLog() at ??:0
@ 0x7f99fab05ab9 google::LogMessage::Flush() at ??:0
@ 0x7f99fab0891f google::LogMessageFatal::~LogMessageFatal() at ??:0
@ 0x7f99fae8a637 kudu::subtle::RefCountedThreadSafeBase::AddRef() at
@ 0x7f9a07c486d8 make_scoped_refptr<>() at ??:0
@ 0x7f9a07c2f7f7 kudu::master::TableInfo::GetTabletsInRange() at ??:0
@ 0x7f9a07c2ec35 kudu::master::CatalogManager::GetTableLocations() at
@ 0x51962e kudu::master::WaitForRunningTabletCount() at
/data1/jenkins-workspace/kudu @ 0x51ce59
kudu::CreateTableStressTest_RestartMasterDuringCreation_Test::TestBody() a @
@ 0x7f99fc7ea012 testing::Test::Run() at ??:0
@ 0x7f99fc7ea158 testing::TestInfo::Run() at ??:0
@ 0x7f99fc7ea235 testing::TestCase::Run() at ??:0
@ 0x7f99fc7ea518 testing::internal::UnitTestImpl::RunAllTests() at ??:0
testing::internal::HandleExceptionsInMethodIfSupported<>() at ??:0
@ 0x7f99fc7ea7fd testing::UnitTest::Run() at ??:0
@ 0x7f9a08520cf6 main at ??:0
@ 0x7f99f97a9d5d __libc_start_main at ??:0
@ 0x43e06d (unknown) at ??:0
5 files changed, 71 insertions(+), 38 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/65/3465/1
To view, visit http://gerrit.cloudera.org:8080/3465
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Owner: Adar Dembo <a...@cloudera.com>
Gerrit-Reviewer: Mike Percy <mpe...@apache.org>
Gerrit-Reviewer: Todd Lipcon <t...@apache.org>