[
https://issues.apache.org/jira/browse/KUDU-1865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16781149#comment-16781149
]
Will Berkeley commented on KUDU-1865:
-------------------------------------
Here's some of the relevant stacks:
This thread is holding the lock for the central cache:
{noformat}
tids=[5700]
0x379ba0f710 <unknown>
0x9bc86d tcmalloc::ThreadCache::ReleaseToCentralCache()
0x9bcb4f tcmalloc::ThreadCache::Scavenge()
0xa86534
_ZNSt6vectorISt4pairIPN4kudu6tablet6RowSetEiESaIS5_EE19_M_emplace_back_auxIJRS4_RiEEEvDpOT_
0xa742cd
_ZNSt17_Function_handlerIFvPN4kudu6tablet6RowSetEiEZNS1_6Tablet17BulkCheckPresenceEPKNS0_2fs9IOContextEPNS1_21WriteTransactionStateEEUlS3_iE2_E9_M_invokeERKSt9_Any_dataS3_i
0xaee074
_ZNK4kudu22interval_tree_internal6ITNodeINS_6tablet20RowSetIntervalTraitsEE31ForEachIntervalContainingPointsIZNKS2_10RowSetTree27ForEachRowSetContainingKeysERKSt6vectorINS_5SliceESa
IS8_EERKSt8functionIFvPNS2_6RowSetEiEEEUlRKNS2_12_GLOBAL__N_111QueryStructEPNS2_16RowSetWithBoundsEE_N9__gnu_cxx17__normal_iteratorIPSM_S7_ISL_SaISL_EEEEEEvT0_SX_RKT_
0xaedd03
_ZNK4kudu22interval_tree_internal6ITNodeINS_6tablet20RowSetIntervalTraitsEE31ForEachIntervalContainingPointsIZNKS2_10RowSetTree27ForEachRowSetContainingKeysERKSt6vectorINS_5SliceESa
IS8_EERKSt8functionIFvPNS2_6RowSetEiEEEUlRKNS2_12_GLOBAL__N_111QueryStructEPNS2_16RowSetWithBoundsEE_N9__gnu_cxx17__normal_iteratorIPSM_S7_ISL_SaISL_EEEEEEvT0_SX_RKT_
0xaedd03
_ZNK4kudu22interval_tree_internal6ITNodeINS_6tablet20RowSetIntervalTraitsEE31ForEachIntervalContainingPointsIZNKS2_10RowSetTree27ForEachRowSetContainingKeysERKSt6vectorINS_5SliceESa
IS8_EERKSt8functionIFvPNS2_6RowSetEiEEEUlRKNS2_12_GLOBAL__N_111QueryStructEPNS2_16RowSetWithBoundsEE_N9__gnu_cxx17__normal_iteratorIPSM_S7_ISL_SaISL_EEEEEEvT0_SX_RKT_
0xaee1b3
_ZNK4kudu22interval_tree_internal6ITNodeINS_6tablet20RowSetIntervalTraitsEE31ForEachIntervalContainingPointsIZNKS2_10RowSetTree27ForEachRowSetContainingKeysERKSt6vectorINS_5SliceESa
IS8_EERKSt8functionIFvPNS2_6RowSetEiEEEUlRKNS2_12_GLOBAL__N_111QueryStructEPNS2_16RowSetWithBoundsEE_N9__gnu_cxx17__normal_iteratorIPSM_S7_ISL_SaISL_EEEEEEvT0_SX_RKT_
0xaee3a3 kudu::tablet::RowSetTree::ForEachRowSetContainingKeys()
0xa80c17 kudu::tablet::Tablet::BulkCheckPresence()
0xa8108a kudu::tablet::Tablet::ApplyRowOperations()
0xab4e7a kudu::tablet::WriteTransaction::Apply()
0xaad2b5 kudu::tablet::TransactionDriver::ApplyTask()
0x1fa32dd kudu::ThreadPool::DispatchThread()
0x1f9bd91 kudu::Thread::SuperviseThread()
{noformat}
Here's a bunch of threads backed up waiting on it:
{noformat}
tids=[1391]
0x379ba0f710 <unknown>
0x9c9dad tcmalloc::internal::SpinLockDelay()
0x9c1929 SpinLock::SlowLock()
0x9c1ee8 tcmalloc::CentralFreeList::InsertRange()
0x9bc834 tcmalloc::ThreadCache::ReleaseToCentralCache()
0x9bcbd5 tcmalloc::ThreadCache::ListTooLong()
0x1e6dbd8
google::protobuf::internal::RepeatedPtrFieldBase::Destroy<>()
0x1e6dcae kudu::tablet::TxResultPB::~TxResultPB()
0x1db60da kudu::consensus::CommitMsg::~CommitMsg()
0xbc9feb
google::protobuf::internal::RepeatedPtrFieldBase::Destroy<>()
0xbca0de kudu::log::LogEntryBatchPB::~LogEntryBatchPB()
0xbb300c kudu::log::Log::AppendThread::HandleGroup()
0xbb395e kudu::log::Log::AppendThread::DoWork()
0x1fa32dd kudu::ThreadPool::DispatchThread()
0x1f9bd91 kudu::Thread::SuperviseThread()
0x379ba079d1 start_thread
tids=[5695]
0x379ba0f710 <unknown>
0x9c9dad tcmalloc::internal::SpinLockDelay()
0x9c1929 SpinLock::SlowLock()
0x9c2391 tcmalloc::CentralFreeList::RemoveRange()
0x9bc675 tcmalloc::ThreadCache::FetchFromCentralCache()
0x1ec4f81 std::vector<>::_M_emplace_back_aux<>()
0x1ec4b27 kudu::RowOperationsPBDecoder::DecodeOperations()
0xa78388 kudu::tablet::Tablet::DecodeWriteOperations()
0xab3d39 kudu::tablet::WriteTransaction::Prepare()
0xaaae45 kudu::tablet::TransactionDriver::Prepare()
0xaabbdd kudu::tablet::TransactionDriver::PrepareTask()
0x1fa32dd kudu::ThreadPool::DispatchThread()
0x1f9bd91 kudu::Thread::SuperviseThread()
0x379ba079d1 start_thread
0x379b6e88fd clone
tids=[5661]
0x379ba0f710 <unknown>
0x9c9e07 tcmalloc::internal::SpinLockWake()
0x9c2373 tcmalloc::CentralFreeList::RemoveRange()
0x9bc675 tcmalloc::ThreadCache::FetchFromCentralCache()
0x1e94853 kudu::EncodedKeyBuilder::BuildEncodedKey()
0x1e9535c kudu::EncodedKey::FromContiguousRow()
0xa6b77e kudu::tablet::Tablet::AcquireLockForOp()
0xa6baca kudu::tablet::Tablet::AcquireRowLocks()
0xab3f11 kudu::tablet::WriteTransaction::Prepare()
0xaaae45 kudu::tablet::TransactionDriver::Prepare()
0xaabbdd kudu::tablet::TransactionDriver::PrepareTask()
0x1fa32dd kudu::ThreadPool::DispatchThread()
0x1f9bd91 kudu::Thread::SuperviseThread()
0x379ba079d1 start_thread
0x379b6e88fd clone
{noformat}
> Create fast path for RespondSuccess() in KRPC
> ---------------------------------------------
>
> Key: KUDU-1865
> URL: https://issues.apache.org/jira/browse/KUDU-1865
> Project: Kudu
> Issue Type: Improvement
> Components: rpc
> Reporter: Sailesh Mukil
> Priority: Major
> Labels: perfomance, rpc
> Attachments: alloc-pattern.py, cross-thread.txt
>
>
> A lot of RPCs just respond with RespondSuccess() which returns the exact
> payload every time. This takes the same path as any other response by
> ultimately calling Connection::QueueResponseForCall() which has a few small
> allocations. These small allocations (and their corresponding deallocations)
> are called quite frequently (once for every IncomingCall) and end up taking
> quite some time in the kernel (traversing the free list, spin locks etc.)
> This was found when [~mmokhtar] ran some profiles on Impala over KRPC on a 20
> node cluster and found the following:
> The exact % of time spent is hard to quantify from the profiles, but these
> were the among the top 5 of the slowest stacks:
> {code:java}
> impalad ! tcmalloc::CentralFreeList::ReleaseToSpans - [unknown source file]
> impalad ! tcmalloc::CentralFreeList::ReleaseListToSpans + 0x1a - [unknown
> source file]
> impalad ! tcmalloc::CentralFreeList::InsertRange + 0x3b - [unknown source
> file]
> impalad ! tcmalloc::ThreadCache::ReleaseToCentralCache + 0x103 - [unknown
> source file]
> impalad ! tcmalloc::ThreadCache::Scavenge + 0x3e - [unknown source file]
> impalad ! operator delete + 0x329 - [unknown source file]
> impalad ! __gnu_cxx::new_allocator<kudu::Slice>::deallocate + 0x4 -
> new_allocator.h:110
> impalad ! std::_Vector_base<kudu::Slice,
> std::allocator<kudu::Slice>>::_M_deallocate + 0x5 - stl_vector.h:178
> impalad ! ~_Vector_base + 0x4 - stl_vector.h:160
> impalad ! ~vector - stl_vector.h:425 <----Deleting
> 'slices' vector
> impalad ! kudu::rpc::Connection::QueueResponseForCall + 0xac -
> connection.cc:433
> impalad ! kudu::rpc::InboundCall::Respond + 0xfa - inbound_call.cc:133
> impalad ! kudu::rpc::InboundCall::RespondSuccess + 0x43 - inbound_call.cc:77
> impalad ! kudu::rpc::RpcContext::RespondSuccess + 0x1f7 - rpc_context.cc:66
> ..
> {code}
> {code:java}
> impalad ! tcmalloc::CentralFreeList::FetchFromOneSpans - [unknown source file]
> impalad ! tcmalloc::CentralFreeList::RemoveRange + 0xc0 - [unknown source
> file]
> impalad ! tcmalloc::ThreadCache::FetchFromCentralCache + 0x62 - [unknown
> source file]
> impalad ! operator new + 0x297 - [unknown source file] <--- Creating
> new 'OutboundTransferTask' object.
> impalad ! kudu::rpc::Connection::QueueResponseForCall + 0x76 -
> connection.cc:432
> impalad ! kudu::rpc::InboundCall::Respond + 0xfa - inbound_call.cc:133
> impalad ! kudu::rpc::InboundCall::RespondSuccess + 0x43 - inbound_call.cc:77
> impalad ! kudu::rpc::RpcContext::RespondSuccess + 0x1f7 - rpc_context.cc:66
> ...
> {code}
> Even creating and deleting the 'RpcContext' takes a lot of time:
> {code:java}
> impalad ! tcmalloc::CentralFreeList::ReleaseToSpans - [unknown source file]
> impalad ! tcmalloc::CentralFreeList::ReleaseListToSpans + 0x1a - [unknown
> source file]
> impalad ! tcmalloc::CentralFreeList::InsertRange + 0x3b - [unknown source
> file]
> impalad ! tcmalloc::ThreadCache::ReleaseToCentralCache + 0x103 - [unknown
> source file]
> impalad ! tcmalloc::ThreadCache::Scavenge + 0x3e - [unknown source file]
> impalad ! operator delete + 0x329 - [unknown source file]
> impalad ! impala::TransmitDataResponsePb::~TransmitDataResponsePb + 0x16 -
> impala_internal_service.pb.cc:1221
> impalad ! impala::TransmitDataResponsePb::~TransmitDataResponsePb + 0x8 -
> impala_internal_service.pb.cc:1222
> impalad ! kudu::DefaultDeleter<google::protobuf::Message>::operator() + 0x5 -
> gscoped_ptr.h:145
> impalad ! ~gscoped_ptr_impl + 0x9 - gscoped_ptr.h:228
> impalad ! ~gscoped_ptr - gscoped_ptr.h:318
> impalad ! kudu::rpc::RpcContext::~RpcContext + 0x1e - rpc_context.cc:53
> <-----
> impalad ! kudu::rpc::RpcContext::RespondSuccess + 0x1ff - rpc_context.cc:67
> {code}
> The above show that creating these small objects under moderately heavy load
> results in heavy contention in the kernel. We will benefit a lot if we create
> a fast path for 'RespondSuccess'.
> My suggestion is to create all these small objects at once along with the
> 'InboundCall' object while it is being created, in a 'RespondSuccess'
> structure, and just use that structure when we want to send 'success' back to
> the sender. This would already contain the 'OutboundTransferTask', a 'Slice'
> with 'success', etc. We would expect that most RPCs respond with 'success' a
> majority of the time.
> How this would benefit us is that we don't go back and forth every time to
> allocate and deallocate memory for these small objects, instead we do it all
> at once while creating the 'InboundCall' object.
> I just wanted to start a discussion about this, so even if what I suggested
> seems a little off, hopefully we can move forward with this on some level.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)