Mike Percy has submitted this change and it was merged.

Change subject: catalog_manager: prevent spurious dirty callbacks from crashing 
the process

catalog_manager: prevent spurious dirty callbacks from crashing the process

The recent change to switch from LocalConsensus to RaftConsensus has led to
an increased number of dirty callback calls when using just one master. Some
of these calls occur with no term change; the catalog manager treats them as
any other calls and reloads the on-disk metadata.

In theory that's just unnecessary work, but CheckIsLeaderAndReady() doesn't
provide adequate protection for the rest of the master when in this state, so
nearly every RPC is allowed in during this time. That's an absolute disaster
for correctness: imagine a GetTableLocations() returning only a subset of a
table's tablets because the rest were still being loaded from disk.

Luckily for us it can also manifest as a crash [1] so we noticed it quickly.
I chose to fix it by ignoring these calls within
CatalogManager::VisitTablesAndTabletsTask and not
SysCatalogTable::SysCatalogStateChanged because the synchronization in the
former is more straight forward thanks to the size of worker_pool_. The new
test led to a crash 100% of the time without this fix.

There's an argument to be made for changing TableInfo::TabletMap's raw
pointers to shared pointers thus avoiding this crash altogether, but it's
still an incorrect state to be in, so I don't see the value in doing that.

While I was here, I snuck a few other changes in:
- Remove a lock acquisition from ElectedAsLeaderCb; it did nothing.
- Remove old_role_ from SysCatalogTable; it also did nothing.
- Narrow the lock acquisition in VisitTablesAndTabletsTask; it only needs
  to protect the visiting logic.

1. Sample crash output:

F0618 05:47:41.795367  9330 ref_counted.cc:74] Check failed: !in_dtor_
*** Check failure stack trace: ***
    @     0x7f99fab05f7d  google::LogMessage::Fail() at ??:0
    @     0x7f99fab07e7d  google::LogMessage::SendToLog() at ??:0
    @     0x7f99fab05ab9  google::LogMessage::Flush() at ??:0
    @     0x7f99fab0891f  google::LogMessageFatal::~LogMessageFatal() at ??:0
    @     0x7f99fae8a637  kudu::subtle::RefCountedThreadSafeBase::AddRef() at 
    @     0x7f9a07c486d8  make_scoped_refptr<>() at ??:0
    @     0x7f9a07c2f7f7  kudu::master::TableInfo::GetTabletsInRange() at ??:0
    @     0x7f9a07c2ec35  kudu::master::CatalogManager::GetTableLocations() at 
    @           0x51962e  kudu::master::WaitForRunningTabletCount() at ...
    @           0x51ce59  
kudu::CreateTableStressTest_RestartMasterDuringCreation_Test::TestBody() at ...
    @     0x7f99fc7f5b48  
testing::internal::HandleExceptionsInMethodIfSupported<>() at ??:0
    @     0x7f99fc7ea012  testing::Test::Run() at ??:0
    @     0x7f99fc7ea158  testing::TestInfo::Run() at ??:0
    @     0x7f99fc7ea235  testing::TestCase::Run() at ??:0
    @     0x7f99fc7ea518  testing::internal::UnitTestImpl::RunAllTests() at ??:0
    @     0x7f99fc7f6058  
testing::internal::HandleExceptionsInMethodIfSupported<>() at ??:0
    @     0x7f99fc7ea7fd  testing::UnitTest::Run() at ??:0
    @     0x7f9a08520cf6  main at ??:0
    @     0x7f99f97a9d5d  __libc_start_main at ??:0
    @           0x43e06d  (unknown) at ??:0

Change-Id: I5bc602a51a4914ec13f926bfca555c816a2ee509
Reviewed-on: http://gerrit.cloudera.org:8080/3465
Tested-by: Kudu Jenkins
Reviewed-by: Todd Lipcon <t...@apache.org>
Reviewed-by: Mike Percy <mpe...@apache.org>
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
M src/kudu/master/master-test.cc
M src/kudu/master/sys_catalog.cc
M src/kudu/master/sys_catalog.h
5 files changed, 82 insertions(+), 48 deletions(-)

  Mike Percy: Looks good to me, approved
  Todd Lipcon: Looks good to me, but someone else must approve
  Kudu Jenkins: Verified

To view, visit http://gerrit.cloudera.org:8080/3465
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I5bc602a51a4914ec13f926bfca555c816a2ee509
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo <a...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <a...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mpe...@apache.org>
Gerrit-Reviewer: Todd Lipcon <t...@apache.org>

Reply via email to