kangkaisen opened a new issue #242: Poor query performance under high concurrence URL: https://github.com/apache/incubator-doris/issues/242 **Basic Environment:** - OS version: CentOS Linux release 7.1.1503 - Palo version: 3.3.19 - GCC verion: 4.8.5 **Test Environment:** 1 Cluser Info: BE nums: 4 BE: Physical Machine,24CPU, 96MEM 2 Test Table: Table A is a UNIQUE table,dt is PARTITION Key, each PARTITION has 5777108 rows. 3 Test SQL: SQL1: ``` select count(*) FROM A t1 where t1.dt = 20180909; ``` SQL2: ``` select dt, B, C, D, E, F FROM A t1 where dt in (20180909,20180910,20180911) limit 20000; ``` SQL3: ``` select count(*) FROM A t1 left JOIN [shuffle] B t6 ON ((t1.dt = t6.dt) AND (t1.wm_poi_id = t6.wm_poi_id)) where t1.dt = 20180909; ``` Test Method: ``` mysqlslap -u xxx --password=xxx -h xxx -P xxx -c 1,2,4,8,16,32 -i 3 --create-schema=test --query="sql3"; ``` Test Result: SQL1: ``` concurrency1: 0.118 seconds concurrency2: 0.125 seconds concurrency4: 0.154 seconds concurrency8: 0.228 seconds concurrency16: 0.445 seconds concurrency32: 0.904 seconds ``` SQL2: ``` concurrency1: 0.115 seconds concurrency2: 0.136 seconds concurrency4: 0.172 seconds concurrency8: 0.214 seconds concurrency16: 0.289 seconds concurrency32: 0.459 seconds ``` SQL3: ``` concurrency1: 0.864 seconds concurrency2: 1.065 seconds concurrency4: 1.500 seconds concurrency8: 2.423 seconds concurrency16: 4.307 seconds concurrency32: 8.468 seconds ``` **Expected behavior** The more concurrency, the more query slower. which is reasonable. But the query performance downgrade so fast is not reasonable, especially the system resource (cpu, mem, network, io) is not full. After logging and pref, I think there are two bottleneck in Doris: ***Bottleneck 1: TCMalloc*** ***Bottleneck 2: memory_copy*** ***Bottleneck 3: process_probe_batch and process_build_batch for Join SQL*** some perf result: SQL1: perf record -e cycle_activity.stalls_total -ags -p `pidof palo_be` ``` 9.71% palo_be palo_be [.] palo::memory_copy 9.12% palo_be palo_be [.] palo::OlapScanner::get_batch 7.88% palo_be palo_be [.] palo::Reader::_unique_key_next_row 7.81% palo_be palo_be [.] palo::RowCursor::full_key_cmp 5.73% palo_be palo_be [.] palo::CollectIterator::next 4.86% palo_be palo_be [.] palo::VectorizedRowBatch::dump_to_row_block 4.19% palo_be palo_be [.] std::__push_heap<__gnu_cxx::__normal_iterator<palo::CollectIterator::ChildCtx**, std::v 3.26% palo_be palo_be [.] palo::EqIntValPred::get_boolean_val 2.96% palo_be palo_be [.] SpinLock::SpinLoop 2.50% palo_be palo_be [.] palo::OlapScanner::_convert_row_to_tuple 2.36% palo_be [kernel.kallsyms] [k] flush_tlb_func 1.92% palo_be palo_be [.] palo::SlotRef::get_int_val 1.59% palo_be palo_be [.] std::__adjust_heap<__gnu_cxx::__normal_iterator<palo::CollectIterator::ChildCtx**, std: 1.36% palo_be [kernel.kallsyms] [k] clear_page_c_e 1.26% palo_be [kernel.kallsyms] [k] _raw_spin_lock ``` the annotate for palo::memory_copy ``` 1.07 │ pop %rbp │ template<> inline void fixed_size_memory_copy<2>(void* dst, const void* src) { │ *(reinterpret_cast<uint16_t*>(dst)) = * (reinterpret_cast<const uint16_t*>(src)); │ } │ │ template<> inline void fixed_size_memory_copy<4>(void* dst, const void* src) { │ *(reinterpret_cast<uint32_t*>(dst)) = * (reinterpret_cast<const uint32_t*>(src)); 32.73 │ mov %eax,(%rdi) │ return fixed_size_memory_copy<255>(dst, src); │ } │ │ memcpy(dst, src, size); │ return; │ } 36.66 │ ← retq │ inline void fixed_size_memory_copy(void* dst, const void* src) { │ struct X { ``` the annotate for VectorizedRowBatch::dump_to_row_block ``` 4.61 │1e0: lea 0x1(%rbx),%rdi 4.98 │ mov %r15,%rsi │ │ TupleRow* row = row_batch->get_row(row_index++); │ row->set_tuple(0, tuple); │ tuple = reinterpret_cast<Tuple*>(reinterpret_cast<uint8_t*>(tuple) + │ tuple_desc.byte_size()); │ } 3.56 │ movb $0x0,(%rbx) │ } else { 33.67 │ mov %r12,%rdx 4.93 │ mov %r8d,-0x38(%rbp) │ for (int i = _row_iter; i < _row_iter + size; ++i) { │ for (int j = 0; j < slots.size(); ++j) { 12.10 │ add %r12,%r15 │ TupleRow* row = row_batch->get_row(row_index++); │ row->set_tuple(0, tuple); │ tuple = reinterpret_cast<Tuple*>(reinterpret_cast<uint8_t*>(tuple) + │ tuple_desc.byte_size()); │ } │ } else { ``` SQL2: ``` 10.20% palo_be palo_be [.] SpinLock::SpinLoop ▒ 4.67% palo_be [kernel.kallsyms] [k] clear_page_c_e ▒ 4.04% palo_be palo_be [.] palo::memory_copy ▒ 3.93% palo_be [kernel.kallsyms] [k] _raw_spin_lock ▒ 3.27% palo_be palo_be [.] palo::VectorizedRowBatch::dump_to_row_block ▒ 3.10% palo_be [kernel.kallsyms] [k] flush_tlb_func ▒ 2.73% palo_be palo_be [.] tcmalloc::CentralFreeList::FetchFromOneSpans ▒ 2.52% palo_be palo_be [.] tc_deletearray_nothrow ▒ 1.96% palo_be palo_be [.] tcmalloc::CentralFreeList::ReleaseToSpans ▒ 1.78% palo_be palo_be [.] operator new[] ▒ 1.70% palo_be palo_be [.] tcmalloc::ThreadCache::ReleaseToCentralCache ``` ``` - 11.71% 10.21% palo_be palo_be [.] SpinLock::SpinLoop ▒ - 9.59% thread_proxy ▒ - 9.20% palo::PriorityThreadPool::work_thread ▒ - 9.19% palo::OlapScanNode::scanner_thread ▒ - 4.60% palo::OlapScanner::close ▒ palo::Reader::~Reader ▒ - palo::Reader::close ▒ - 4.58% palo::OLAPTable::release_data_sources ▒ - 4.58% palo::column_file::ColumnData::~ColumnData ▒ - palo::column_file::ColumnData::~ColumnData ▒ - 4.33% palo::column_file::SegmentReader::~SegmentReader ▒ - 3.82% boost::detail::sp_counted_base::release ▒ - palo::column_file::ByteBuffer::BufDeleter::operator() ▒ + 2.48% tcmalloc::ThreadCache::Scavenge ▒ + 1.33% tcmalloc::ThreadCache::IncreaseCacheLimit ▒ - 3.73% palo::OlapScanner::open ▒ - palo::Reader::init ▒ - 3.41% palo::Reader::_attach_data_to_merge_set ▒ - 3.38% palo::column_file::ColumnData::prepare_block_read ▒ - palo::column_file::ColumnData::_seek_to_row ▒ - 3.06% palo::column_file::ColumnData::_seek_to_block ▒ - 2.87% palo::column_file::SegmentReader::seek_to_block ▒ - 2.50% palo::column_file::SegmentReader::_read_all_data_streams ▒ - 2.45% palo::column_file::ByteBuffer::create ▒ tcmalloc::ThreadCache::FetchFromCentralCache ▒ + tcmalloc::CentralFreeList::RemoveRange ▒ 0.69% palo::OlapScanner::get_batch ``` SQL3 ``` 13.89% palo_be palo_be [.] palo::HashJoinNode::process_probe_batch ▒ 7.11% palo_be palo_be [.] palo::HashJoinNode::process_build_batch ▒ 3.89% palo_be palo_be [.] palo::memory_copy ▒ 3.60% palo_be palo_be [.] palo::ExprContext::get_value ▒ 3.31% palo_be palo_be [.] SpinLock::SpinLoop ▒ 3.15% palo_be palo_be [.] palo::HashTable::resize_buckets ▒ 2.48% palo_be palo_be [.] tcmalloc::CentralFreeList::FetchFromOneSpans ▒ 2.25% palo_be libc-2.17.so [.] __memcpy_ssse3_back ▒ 2.05% palo_be [kernel.kallsyms] [k] flush_tlb_func ▒ 1.91% palo_be palo_be [.] palo::RowCursor::full_key_cmp ▒ 1.68% palo_be [kernel.kallsyms] [k] clear_page_c_e ▒ 1.42% palo_be palo_be [.] palo::VectorizedRowBatch::dump_to_row_block ▒ 1.32% palo_be palo_be [.] tcmalloc::CentralFreeList::ReleaseToSpans ▒ 1.25% palo_be palo_be [.] palo::DataStreamSender::send ▒ 1.15% palo_be palo_be [.] tcmalloc::ThreadCache::ReleaseToCentralCache ``` **Discuss** for tcmalloc bottleneck: 1 we could reduce the call times for ByteBuffer::create 2 Some impala work maybe worth referring IMPALA-5481 RowDescriptors should be shared, rather than copied IMPALA-5518 Allocate KrpcDataStreamRecvr RowBatch tuples from BufferPool IMPALA-6425 Change Mempool memory allocation size to be <1MB to avoid allocating from CentralFreeList IMPALA-4923 Operators running on top of selective Parquet scans spend a lot of time calling impala::MemPool::FreeAll on empty batches for memory bottleneck: I think current memcpy implementation is efficient, we should reduce the call times for palo::memory_copy **Question** Is Tcmallocthe best malloc for Doris? do we need to try jemalloc ? If I think wrong ,please correct me. any comments is welcome!
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@doris.apache.org For additional commands, e-mail: dev-h...@doris.apache.org