Adar Dembo created KUDU-1520:
--------------------------------
Summary: Possible race between alter schema lock release and
tablet shutdown
Key: KUDU-1520
URL: https://issues.apache.org/jira/browse/KUDU-1520
Project: Kudu
Issue Type: Bug
Components: tablet
Affects Versions: 0.9.1
Reporter: Adar Dembo
Priority: Critical
I've been running a new stress that hammers a cluster with concurrent alter and
delete table requests, and one of my test runs failed with the following:
{noformat}
F0707 18:59:34.311122 373 rw_semaphore.h:145] Check failed:
base::subtle::NoBarrier_Load(&state_) == kWriteFlag (0 vs. 2147483648)
*** Check failure stack trace: ***
@ 0x7f86cd37df5d google::LogMessage::Fail() at ??:0
@ 0x7f86cd37fe5d google::LogMessage::SendToLog() at ??:0
@ 0x7f86cd37da99 google::LogMessage::Flush() at ??:0
@ 0x7f86cd3808ff google::LogMessageFatal::~LogMessageFatal() at ??:0
@ 0x7f86d4f77c78 kudu::rw_semaphore::unlock() at ??:0
@ 0x7f86d3728de0 std::unique_lock<>::unlock() at ??:0
@ 0x7f86d3727192 std::unique_lock<>::~unique_lock() at ??:0
@ 0x7f86d3725582
kudu::tablet::AlterSchemaTransactionState::~AlterSchemaTransactionState() at
??:0
@ 0x7f86d37255be
kudu::tablet::AlterSchemaTransactionState::~AlterSchemaTransactionState() at
??:0
@ 0x7f86d4f68dce std::default_delete<>::operator()() at ??:0
@ 0x7f86d4f670b9 std::unique_ptr<>::~unique_ptr() at ??:0
@ 0x7f86d374510e
kudu::tablet::AlterSchemaTransaction::~AlterSchemaTransaction() at ??:0
@ 0x7f86d374514a
kudu::tablet::AlterSchemaTransaction::~AlterSchemaTransaction() at ??:0
@ 0x7f86d373f532 kudu::DefaultDeleter<>::operator()() at ??:0
@ 0x7f86d373df4a
kudu::internal::gscoped_ptr_impl<>::~gscoped_ptr_impl() at ??:0
@ 0x7f86d373d552 gscoped_ptr<>::~gscoped_ptr() at ??:0
@ 0x7f86d373d580 kudu::tablet::TransactionDriver::~TransactionDriver()
at ??:0
@ 0x7f86d3740ab4 kudu::RefCountedThreadSafe<>::DeleteInternal() at ??:0
@ 0x7f86d3740405 kudu::DefaultRefCountedThreadSafeTraits<>::Destruct()
at ??:0
@ 0x7f86d373f928 kudu::RefCountedThreadSafe<>::Release() at ??:0
@ 0x7f86d373e769 scoped_refptr<>::~scoped_refptr() at ??:0
@ 0x7f86d37397cd kudu::tablet::TabletPeer::SubmitAlterSchema() at ??:0
@ 0x7f86d4f4e070 kudu::tserver::TabletServiceAdminImpl::AlterSchema()
at ??:0
@ 0x7f86d27a4e92
_ZZN4kudu7tserver26TabletServerAdminServiceIfC4ERK13scoped_refptrINS_12MetricEntityEEENKUlPKN6google8protobuf7MessageEPS9_PNS_3rpc10RpcContextEE1_clESB_SC_SF_
at ??:0
@ 0x7f86d27a5d96
_ZNSt17_Function_handlerIFvPKN6google8protobuf7MessageEPS2_PN4kudu3rpc10RpcContextEEZNS6_7tserver26TabletServerAdminServiceIfC4ERK13scoped_refptrINS6_12MetricEntityEEEUlS4_S5_S9_E1_E9_M_invokeERKSt9_Any_dataS4_S5_S9_
at ??:0
@ 0x7f86d22ce6e4 std::function<>::operator()() at ??:0
@ 0x7f86d22ce19b kudu::rpc::GeneratedServiceIf::Handle() at ??:0
@ 0x7f86d22d0a97 kudu::rpc::ServicePool::RunThread() at ??:0
@ 0x7f86d22d1d45 boost::_mfi::mf0<>::operator()() at ??:0
@ 0x7f86d22d1b6c boost::_bi::list1<>::operator()<>() at ??:0
@ 0x7f86d22d1a61 boost::_bi::bind_t<>::operator()() at ??:0
@ 0x7f86d22d1998
boost::detail::function::void_function_obj_invoker0<>::invoke() at ??:0
{noformat}
After looking through the code a bit, I suspect this happened because, in the
event of failure, the AlterSchema transaction releases the tablet's schema lock
implicitly (i.e. when AlterSchemaTransactionState is destroyed) _after_ the
transaction itself is removed from the driver's TransactionTracker. Thus, the
WaitForAllToFinish() performed during the tablet shutdown process thinks all
the transactions are done and proceeds to free tablet state. Later, the last
reference to the transaction is released (in TabletPeer::SubmitAlterSchema),
the transaction is destroyed, and we try to unlock a lock whose memory has
already been freed.
If this analysis is correct, the broken invariant is: once the transaction has
been released from the tracker, it may no longer access any tablet state.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)