[
https://issues.apache.org/jira/browse/KUDU-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Grant Henke updated KUDU-1520:
------------------------------
Target Version/s: (was: 1.5.0)
> Possible race between alter schema lock release and tablet shutdown
> -------------------------------------------------------------------
>
> Key: KUDU-1520
> URL: https://issues.apache.org/jira/browse/KUDU-1520
> Project: Kudu
> Issue Type: Bug
> Components: tablet
> Affects Versions: 0.9.1
> Reporter: Adar Dembo
> Priority: Major
>
> I've been running a new stress that hammers a cluster with concurrent alter
> and delete table requests, and one of my test runs failed with the following:
> {noformat}
> F0707 18:59:34.311122 373 rw_semaphore.h:145] Check failed:
> base::subtle::NoBarrier_Load(&state_) == kWriteFlag (0 vs. 2147483648)
> *** Check failure stack trace: ***
> @ 0x7f86cd37df5d google::LogMessage::Fail() at ??:0
> @ 0x7f86cd37fe5d google::LogMessage::SendToLog() at ??:0
> @ 0x7f86cd37da99 google::LogMessage::Flush() at ??:0
> @ 0x7f86cd3808ff google::LogMessageFatal::~LogMessageFatal() at ??:0
> @ 0x7f86d4f77c78 kudu::rw_semaphore::unlock() at ??:0
> @ 0x7f86d3728de0 std::unique_lock<>::unlock() at ??:0
> @ 0x7f86d3727192 std::unique_lock<>::~unique_lock() at ??:0
> @ 0x7f86d3725582
> kudu::tablet::AlterSchemaTransactionState::~AlterSchemaTransactionState() at
> ??:0
> @ 0x7f86d37255be
> kudu::tablet::AlterSchemaTransactionState::~AlterSchemaTransactionState() at
> ??:0
> @ 0x7f86d4f68dce std::default_delete<>::operator()() at ??:0
> @ 0x7f86d4f670b9 std::unique_ptr<>::~unique_ptr() at ??:0
> @ 0x7f86d374510e
> kudu::tablet::AlterSchemaTransaction::~AlterSchemaTransaction() at ??:0
> @ 0x7f86d374514a
> kudu::tablet::AlterSchemaTransaction::~AlterSchemaTransaction() at ??:0
> @ 0x7f86d373f532 kudu::DefaultDeleter<>::operator()() at ??:0
> @ 0x7f86d373df4a
> kudu::internal::gscoped_ptr_impl<>::~gscoped_ptr_impl() at ??:0
> @ 0x7f86d373d552 gscoped_ptr<>::~gscoped_ptr() at ??:0
> @ 0x7f86d373d580
> kudu::tablet::TransactionDriver::~TransactionDriver() at ??:0
> @ 0x7f86d3740ab4 kudu::RefCountedThreadSafe<>::DeleteInternal() at
> ??:0
> @ 0x7f86d3740405
> kudu::DefaultRefCountedThreadSafeTraits<>::Destruct() at ??:0
> @ 0x7f86d373f928 kudu::RefCountedThreadSafe<>::Release() at ??:0
> @ 0x7f86d373e769 scoped_refptr<>::~scoped_refptr() at ??:0
> @ 0x7f86d37397cd kudu::tablet::TabletPeer::SubmitAlterSchema() at
> ??:0
> @ 0x7f86d4f4e070
> kudu::tserver::TabletServiceAdminImpl::AlterSchema() at ??:0
> @ 0x7f86d27a4e92
> _ZZN4kudu7tserver26TabletServerAdminServiceIfC4ERK13scoped_refptrINS_12MetricEntityEEENKUlPKN6google8protobuf7MessageEPS9_PNS_3rpc10RpcContextEE1_clESB_SC_SF_
> at ??:0
> @ 0x7f86d27a5d96
> _ZNSt17_Function_handlerIFvPKN6google8protobuf7MessageEPS2_PN4kudu3rpc10RpcContextEEZNS6_7tserver26TabletServerAdminServiceIfC4ERK13scoped_refptrINS6_12MetricEntityEEEUlS4_S5_S9_E1_E9_M_invokeERKSt9_Any_dataS4_S5_S9_
> at ??:0
> @ 0x7f86d22ce6e4 std::function<>::operator()() at ??:0
> @ 0x7f86d22ce19b kudu::rpc::GeneratedServiceIf::Handle() at ??:0
> @ 0x7f86d22d0a97 kudu::rpc::ServicePool::RunThread() at ??:0
> @ 0x7f86d22d1d45 boost::_mfi::mf0<>::operator()() at ??:0
> @ 0x7f86d22d1b6c boost::_bi::list1<>::operator()<>() at ??:0
> @ 0x7f86d22d1a61 boost::_bi::bind_t<>::operator()() at ??:0
> @ 0x7f86d22d1998
> boost::detail::function::void_function_obj_invoker0<>::invoke() at ??:0
> {noformat}
> After looking through the code a bit, I suspect this happened because, in the
> event of failure, the AlterSchema transaction releases the tablet's schema
> lock implicitly (i.e. when AlterSchemaTransactionState is destroyed) _after_
> the transaction itself is removed from the driver's TransactionTracker. Thus,
> the WaitForAllToFinish() performed during the tablet shutdown process thinks
> all the transactions are done and proceeds to free tablet state. Later, the
> last reference to the transaction is released (in
> TabletPeer::SubmitAlterSchema), the transaction is destroyed, and we try to
> unlock a lock whose memory has already been freed.
> If this analysis is correct, the broken invariant is: once the transaction
> has been released from the tracker, it may no longer access any tablet state.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)