Adar Dembo has uploaded a new patch set (#2).

Change subject: catalog_manager: prevent spurious dirty callbacks from crashing 
the process

catalog_manager: prevent spurious dirty callbacks from crashing the process

The recent change to switch from LocalConsensus to RaftConsensus has led to
an increased number of dirty callback calls when using just one master. Some
of these calls occur with no term change; the catalog manager treats them as
any other calls and reloads the on-disk metadata.

In theory that's just unnecessary work, but CheckIsLeaderAndReady() doesn't
provide adequate protection for the rest of the master when in this state, so
nearly every RPC is allowed in during this time. That's an absolute disaster
for correctness: imagine a GetTableLocations() returning only a subset of a
table's tablets because the rest were still being loaded from disk.

Luckily for us it can also manifest as a crash [1] so we noticed it quickly.
I chose to fix it by ignoring these calls within
CatalogManager::VisitTablesAndTabletsTask and not
SysCatalogTable::SysCatalogStateChanged because the synchronization in the
former is more straight forward thanks to the size of worker_pool_. The new
test led to a crash 100% of the time without this fix.

There's an argument to be made for changing TableInfo::TabletMap's raw
pointers to shared pointers thus avoiding this crash altogether, but it's
still an incorrect state to be in, so I don't see the value in doing that.

While I was here, I snuck a few other changes in:
- Remove a lock acquisition from ElectedAsLeaderCb; it did nothing.
- Remove old_role_ from SysCatalogTable; it also did nothing.
- Narrow the lock acquisition in VisitTablesAndTabletsTask; it only needs
  to protect the visiting logic.

1. Sample crash output:

F0618 05:47:41.795367  9330] Check failed: !in_dtor_
*** Check failure stack trace: ***
    @     0x7f99fab05f7d  google::LogMessage::Fail() at ??:0
    @     0x7f99fab07e7d  google::LogMessage::SendToLog() at ??:0
    @     0x7f99fab05ab9  google::LogMessage::Flush() at ??:0
    @     0x7f99fab0891f  google::LogMessageFatal::~LogMessageFatal() at ??:0
    @     0x7f99fae8a637  kudu::subtle::RefCountedThreadSafeBase::AddRef() at 
    @     0x7f9a07c486d8  make_scoped_refptr<>() at ??:0
    @     0x7f9a07c2f7f7  kudu::master::TableInfo::GetTabletsInRange() at ??:0
    @     0x7f9a07c2ec35  kudu::master::CatalogManager::GetTableLocations() at 
    @           0x51962e  kudu::master::WaitForRunningTabletCount() at ...
    @           0x51ce59  
kudu::CreateTableStressTest_RestartMasterDuringCreation_Test::TestBody() at ...
    @     0x7f99fc7f5b48  
testing::internal::HandleExceptionsInMethodIfSupported<>() at ??:0
    @     0x7f99fc7ea012  testing::Test::Run() at ??:0
    @     0x7f99fc7ea158  testing::TestInfo::Run() at ??:0
    @     0x7f99fc7ea235  testing::TestCase::Run() at ??:0
    @     0x7f99fc7ea518  testing::internal::UnitTestImpl::RunAllTests() at ??:0
    @     0x7f99fc7f6058  
testing::internal::HandleExceptionsInMethodIfSupported<>() at ??:0
    @     0x7f99fc7ea7fd  testing::UnitTest::Run() at ??:0
    @     0x7f9a08520cf6  main at ??:0
    @     0x7f99f97a9d5d  __libc_start_main at ??:0
    @           0x43e06d  (unknown) at ??:0

Change-Id: I5bc602a51a4914ec13f926bfca555c816a2ee509
M src/kudu/master/
M src/kudu/master/catalog_manager.h
M src/kudu/master/
M src/kudu/master/
M src/kudu/master/sys_catalog.h
5 files changed, 71 insertions(+), 38 deletions(-)

  git pull ssh:// refs/changes/65/3465/2
To view, visit
To unsubscribe, visit

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I5bc602a51a4914ec13f926bfca555c816a2ee509
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo <>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <>
Gerrit-Reviewer: Todd Lipcon <>

Reply via email to