kangkaisen opened a new issue #242: Poor query performance under high 
concurrence
URL: https://github.com/apache/incubator-doris/issues/242
 
 
   **Basic Environment:**
   
   - OS version: CentOS Linux release 7.1.1503
   - Palo version: 3.3.19
   - GCC verion: 4.8.5
   
   
   **Test Environment:**
   
   1 Cluser Info:
    BE nums: 4
    BE: Physical Machine,24CPU, 96MEM
   
   2 Test Table: 
   Table A is a UNIQUE table,dt is PARTITION Key,
   each PARTITION has 5777108 rows.
   
   3 Test SQL:
   
   SQL1: 
   ```
   select count(*) FROM A t1 where t1.dt = 20180909;
   ```
   
   SQL2: 
   ```
   select dt, B, C, D, E, F 
   FROM A t1 
   where dt in (20180909,20180910,20180911) 
   limit 20000;
   ```
   
   SQL3:
   ```
   select count(*) FROM A t1 
   left JOIN [shuffle] B t6 
   ON ((t1.dt = t6.dt) AND (t1.wm_poi_id = t6.wm_poi_id)) 
   where t1.dt = 20180909;
   ```
   
   Test Method:
   
   ```
   mysqlslap -u xxx --password=xxx -h xxx -P xxx -c 1,2,4,8,16,32 -i 3 
--create-schema=test --query="sql3";
   ```
   
   Test Result:
   
   SQL1:
   
   ```
   concurrency1: 0.118 seconds
   concurrency2: 0.125 seconds
   concurrency4: 0.154 seconds
   concurrency8: 0.228 seconds
   concurrency16: 0.445 seconds
   concurrency32: 0.904 seconds
   ```
   
   SQL2:
   
   ```
   concurrency1: 0.115 seconds
   concurrency2: 0.136 seconds
   concurrency4: 0.172 seconds
   concurrency8: 0.214 seconds
   concurrency16: 0.289 seconds
   concurrency32: 0.459 seconds
   ```
   
   SQL3:
   ```
   concurrency1: 0.864 seconds
   concurrency2: 1.065 seconds
   concurrency4: 1.500 seconds
   concurrency8: 2.423 seconds
   concurrency16: 4.307 seconds
   concurrency32: 8.468 seconds
   ```
   
   **Expected behavior**
   The more concurrency, the more query slower. which is reasonable.
   But the query performance downgrade so fast is not reasonable, especially
   the system resource (cpu, mem, network, io) is not full.
   
   
   
   After logging and pref, I think there are two bottleneck in Doris:
   
   
   
   ***Bottleneck 1: TCMalloc***
   
   ***Bottleneck 2: memory_copy***
   
   ***Bottleneck 3: process_probe_batch and process_build_batch for Join SQL***
   
   some perf result:
   
   
   SQL1:
   
   perf record -e cycle_activity.stalls_total  -ags -p `pidof palo_be`
   
   ```
      9.71%  palo_be  palo_be             [.] palo::memory_copy
      9.12%  palo_be  palo_be             [.] palo::OlapScanner::get_batch
      7.88%  palo_be  palo_be             [.] palo::Reader::_unique_key_next_row
      7.81%  palo_be  palo_be             [.] palo::RowCursor::full_key_cmp
      5.73%  palo_be  palo_be             [.] palo::CollectIterator::next
      4.86%  palo_be  palo_be             [.] 
palo::VectorizedRowBatch::dump_to_row_block
      4.19%  palo_be  palo_be             [.] 
std::__push_heap<__gnu_cxx::__normal_iterator<palo::CollectIterator::ChildCtx**,
 std::v
      3.26%  palo_be  palo_be             [.] 
palo::EqIntValPred::get_boolean_val
      2.96%  palo_be  palo_be             [.] SpinLock::SpinLoop
      2.50%  palo_be  palo_be             [.] 
palo::OlapScanner::_convert_row_to_tuple
      2.36%  palo_be  [kernel.kallsyms]   [k] flush_tlb_func
      1.92%  palo_be  palo_be             [.] palo::SlotRef::get_int_val
      1.59%  palo_be  palo_be             [.] 
std::__adjust_heap<__gnu_cxx::__normal_iterator<palo::CollectIterator::ChildCtx**,
 std:
      1.36%  palo_be  [kernel.kallsyms]   [k] clear_page_c_e
      1.26%  palo_be  [kernel.kallsyms]   [k] _raw_spin_lock
   ```
   
   the annotate for palo::memory_copy
   
   ```
     1.07 │        pop    %rbp
          │      template<> inline void fixed_size_memory_copy<2>(void* dst, 
const void* src) {
          │          *(reinterpret_cast<uint16_t*>(dst)) = * 
(reinterpret_cast<const uint16_t*>(src));
          │      }
          │
          │      template<> inline void fixed_size_memory_copy<4>(void* dst, 
const void* src) {
          │          *(reinterpret_cast<uint32_t*>(dst)) = * 
(reinterpret_cast<const uint32_t*>(src));
    32.73 │        mov    %eax,(%rdi)
          │              return fixed_size_memory_copy<255>(dst, src);
          │          }
          │
          │          memcpy(dst, src, size);
          │          return;
          │      }
    36.66 │      ← retq
          │      inline void fixed_size_memory_copy(void* dst, const void* src) 
{
          │          struct X {
   ```
   
   the annotate for VectorizedRowBatch::dump_to_row_block
   
   ```
     4.61 │1e0:   lea    0x1(%rbx),%rdi
     4.98 │       mov    %r15,%rsi
          │
          │                 TupleRow* row = row_batch->get_row(row_index++);
          │                 row->set_tuple(0, tuple);
          │                 tuple = 
reinterpret_cast<Tuple*>(reinterpret_cast<uint8_t*>(tuple) +
          │                                                  
tuple_desc.byte_size());
          │             }
     3.56 │       movb   $0x0,(%rbx)
          │         } else {
    33.67 │       mov    %r12,%rdx
     4.93 │       mov    %r8d,-0x38(%rbp)
          │             for (int i = _row_iter; i < _row_iter + size; ++i) {
          │                 for (int j = 0; j < slots.size(); ++j) {
    12.10 │       add    %r12,%r15
          │                 TupleRow* row = row_batch->get_row(row_index++);
          │                 row->set_tuple(0, tuple);
          │                 tuple = 
reinterpret_cast<Tuple*>(reinterpret_cast<uint8_t*>(tuple) +
          │                                                  
tuple_desc.byte_size());
          │             }
          │         } else {
   ```
   
   SQL2:
   
   
   ```
     10.20%  palo_be  palo_be             [.] SpinLock::SpinLoop                
                                                    ▒
      4.67%  palo_be  [kernel.kallsyms]   [k] clear_page_c_e                    
                                                    ▒
      4.04%  palo_be  palo_be             [.] palo::memory_copy                 
                                                    ▒
      3.93%  palo_be  [kernel.kallsyms]   [k] _raw_spin_lock                    
                                                    ▒
      3.27%  palo_be  palo_be             [.] 
palo::VectorizedRowBatch::dump_to_row_block                                     
      ▒
      3.10%  palo_be  [kernel.kallsyms]   [k] flush_tlb_func                    
                                                    ▒
      2.73%  palo_be  palo_be             [.] 
tcmalloc::CentralFreeList::FetchFromOneSpans                                    
      ▒
      2.52%  palo_be  palo_be             [.] tc_deletearray_nothrow            
                                                    ▒
      1.96%  palo_be  palo_be             [.] 
tcmalloc::CentralFreeList::ReleaseToSpans                                       
      ▒
      1.78%  palo_be  palo_be             [.] operator new[]                    
                                                    ▒
      1.70%  palo_be  palo_be             [.] 
tcmalloc::ThreadCache::ReleaseToCentralCache
   
   ```
   
   
   ```
   -   11.71%    10.21%  palo_be  palo_be             [.] SpinLock::SpinLoop    
                                                    ▒
      - 9.59% thread_proxy                                                      
                                                    ▒
         - 9.20% palo::PriorityThreadPool::work_thread                          
                                                    ▒
            - 9.19% palo::OlapScanNode::scanner_thread                          
                                                    ▒
               - 4.60% palo::OlapScanner::close                                 
                                                    ▒
                    palo::Reader::~Reader                                       
                                                    ▒
                  - palo::Reader::close                                         
                                                    ▒
                     - 4.58% palo::OLAPTable::release_data_sources              
                                                    ▒
                        - 4.58% palo::column_file::ColumnData::~ColumnData      
                                                    ▒
                           - palo::column_file::ColumnData::~ColumnData         
                                                    ▒
                              - 4.33% 
palo::column_file::SegmentReader::~SegmentReader                                
              ▒
                                 - 3.82% 
boost::detail::sp_counted_base::release                                         
           ▒
                                    - 
palo::column_file::ByteBuffer::BufDeleter::operator()                           
              ▒
                                       + 2.48% tcmalloc::ThreadCache::Scavenge  
                                                    ▒
                                       + 1.33% 
tcmalloc::ThreadCache::IncreaseCacheLimit                                       
     ▒
               - 3.73% palo::OlapScanner::open                                  
                                                    ▒
                  - palo::Reader::init                                          
                                                    ▒
                     - 3.41% palo::Reader::_attach_data_to_merge_set            
                                                    ▒
                        - 3.38% 
palo::column_file::ColumnData::prepare_block_read                               
                    ▒
                           - palo::column_file::ColumnData::_seek_to_row        
                                                    ▒
                              - 3.06% 
palo::column_file::ColumnData::_seek_to_block                                   
              ▒
                                 - 2.87% 
palo::column_file::SegmentReader::seek_to_block                                 
           ▒
                                    - 2.50% 
palo::column_file::SegmentReader::_read_all_data_streams                        
        ▒
                                       - 2.45% 
palo::column_file::ByteBuffer::create                                           
     ▒
                                            
tcmalloc::ThreadCache::FetchFromCentralCache                                    
        ▒
                                          + 
tcmalloc::CentralFreeList::RemoveRange                                          
        ▒
                 0.69% palo::OlapScanner::get_batch
   ```
   
   SQL3
   
   ```
     13.89%  palo_be  palo_be             [.] 
palo::HashJoinNode::process_probe_batch                                         
      ▒
      7.11%  palo_be  palo_be             [.] 
palo::HashJoinNode::process_build_batch                                         
      ▒
      3.89%  palo_be  palo_be             [.] palo::memory_copy                 
                                                    ▒
      3.60%  palo_be  palo_be             [.] palo::ExprContext::get_value      
                                                    ▒
      3.31%  palo_be  palo_be             [.] SpinLock::SpinLoop                
                                                    ▒
      3.15%  palo_be  palo_be             [.] palo::HashTable::resize_buckets   
                                                    ▒
      2.48%  palo_be  palo_be             [.] 
tcmalloc::CentralFreeList::FetchFromOneSpans                                    
      ▒
      2.25%  palo_be  libc-2.17.so        [.] __memcpy_ssse3_back               
                                                    ▒
      2.05%  palo_be  [kernel.kallsyms]   [k] flush_tlb_func                    
                                                    ▒
      1.91%  palo_be  palo_be             [.] palo::RowCursor::full_key_cmp     
                                                    ▒
      1.68%  palo_be  [kernel.kallsyms]   [k] clear_page_c_e                    
                                                    ▒
      1.42%  palo_be  palo_be             [.] 
palo::VectorizedRowBatch::dump_to_row_block                                     
      ▒
      1.32%  palo_be  palo_be             [.] 
tcmalloc::CentralFreeList::ReleaseToSpans                                       
      ▒
      1.25%  palo_be  palo_be             [.] palo::DataStreamSender::send      
                                                    ▒
      1.15%  palo_be  palo_be             [.] 
tcmalloc::ThreadCache::ReleaseToCentralCache
   ```
   
   
   **Discuss**
   
   for tcmalloc bottleneck:
   
   1 we could reduce the call times for ByteBuffer::create
   
   2 Some impala work maybe worth referring
   
       IMPALA-5481 RowDescriptors should be shared, rather than copied
   
       IMPALA-5518 Allocate KrpcDataStreamRecvr RowBatch tuples from BufferPool
   
       IMPALA-6425  Change Mempool memory allocation size to be <1MB to avoid 
allocating from CentralFreeList 
   
       IMPALA-4923  Operators running on top of selective Parquet scans spend a 
lot of time calling impala::MemPool::FreeAll on empty batches
   
   
   for memory bottleneck:
   
   I think current memcpy implementation is efficient, we should reduce the 
call times for palo::memory_copy  
   
   
   **Question**
   Is Tcmallocthe best malloc for Doris? do we need to try jemalloc ?
   
   
   If I think wrong ,please correct me.  any comments is welcome!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@doris.apache.org
For additional commands, e-mail: dev-h...@doris.apache.org

Reply via email to