Adar Dembo has uploaded a new patch set (#2). Change subject: catalog_manager: prevent spurious dirty callbacks from crashing the process ......................................................................
catalog_manager: prevent spurious dirty callbacks from crashing the process The recent change to switch from LocalConsensus to RaftConsensus has led to an increased number of dirty callback calls when using just one master. Some of these calls occur with no term change; the catalog manager treats them as any other calls and reloads the on-disk metadata. In theory that's just unnecessary work, but CheckIsLeaderAndReady() doesn't provide adequate protection for the rest of the master when in this state, so nearly every RPC is allowed in during this time. That's an absolute disaster for correctness: imagine a GetTableLocations() returning only a subset of a table's tablets because the rest were still being loaded from disk. Luckily for us it can also manifest as a crash [1] so we noticed it quickly. I chose to fix it by ignoring these calls within CatalogManager::VisitTablesAndTabletsTask and not SysCatalogTable::SysCatalogStateChanged because the synchronization in the former is more straight forward thanks to the size of worker_pool_. The new test led to a crash 100% of the time without this fix. There's an argument to be made for changing TableInfo::TabletMap's raw pointers to shared pointers thus avoiding this crash altogether, but it's still an incorrect state to be in, so I don't see the value in doing that. While I was here, I snuck a few other changes in: - Remove a lock acquisition from ElectedAsLeaderCb; it did nothing. - Remove old_role_ from SysCatalogTable; it also did nothing. - Narrow the lock acquisition in VisitTablesAndTabletsTask; it only needs to protect the visiting logic. 1. Sample crash output: F0618 05:47:41.795367 9330 ref_counted.cc:74] Check failed: !in_dtor_ *** Check failure stack trace: *** @ 0x7f99fab05f7d google::LogMessage::Fail() at ??:0 @ 0x7f99fab07e7d google::LogMessage::SendToLog() at ??:0 @ 0x7f99fab05ab9 google::LogMessage::Flush() at ??:0 @ 0x7f99fab0891f google::LogMessageFatal::~LogMessageFatal() at ??:0 @ 0x7f99fae8a637 kudu::subtle::RefCountedThreadSafeBase::AddRef() at ??:0 @ 0x7f9a07c486d8 make_scoped_refptr<>() at ??:0 @ 0x7f9a07c2f7f7 kudu::master::TableInfo::GetTabletsInRange() at ??:0 @ 0x7f9a07c2ec35 kudu::master::CatalogManager::GetTableLocations() at ??:0 @ 0x51962e kudu::master::WaitForRunningTabletCount() at ... @ 0x51ce59 kudu::CreateTableStressTest_RestartMasterDuringCreation_Test::TestBody() at ... @ 0x7f99fc7f5b48 testing::internal::HandleExceptionsInMethodIfSupported<>() at ??:0 @ 0x7f99fc7ea012 testing::Test::Run() at ??:0 @ 0x7f99fc7ea158 testing::TestInfo::Run() at ??:0 @ 0x7f99fc7ea235 testing::TestCase::Run() at ??:0 @ 0x7f99fc7ea518 testing::internal::UnitTestImpl::RunAllTests() at ??:0 @ 0x7f99fc7f6058 testing::internal::HandleExceptionsInMethodIfSupported<>() at ??:0 @ 0x7f99fc7ea7fd testing::UnitTest::Run() at ??:0 @ 0x7f9a08520cf6 main at ??:0 @ 0x7f99f97a9d5d __libc_start_main at ??:0 @ 0x43e06d (unknown) at ??:0 Change-Id: I5bc602a51a4914ec13f926bfca555c816a2ee509 --- M src/kudu/master/catalog_manager.cc M src/kudu/master/catalog_manager.h M src/kudu/master/master-test.cc M src/kudu/master/sys_catalog.cc M src/kudu/master/sys_catalog.h 5 files changed, 71 insertions(+), 38 deletions(-) git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/65/3465/2 -- To view, visit http://gerrit.cloudera.org:8080/3465 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I5bc602a51a4914ec13f926bfca555c816a2ee509 Gerrit-PatchSet: 2 Gerrit-Project: kudu Gerrit-Branch: master Gerrit-Owner: Adar Dembo <a...@cloudera.com> Gerrit-Reviewer: Kudu Jenkins Gerrit-Reviewer: Mike Percy <mpe...@apache.org> Gerrit-Reviewer: Todd Lipcon <t...@apache.org>