mingmwang commented on issue #4973:
URL: 
https://github.com/apache/arrow-datafusion/issues/4973#issuecomment-1610843919

   Some finding when reading DuckDB source code:
   
   1)  DuckDB uses generics to manage the Accumulator state, it uses function 
pointers to do the dispatch, there is some C++ static cast.
   2) There is no magic to the HashMap, the PartitionableHashTable is not used 
in the single thread model. The two level hash tables' idea is the same as 
ClickHouse, when the total groups in the single hash table exceeds a 
threshold(10000), it turns to the partitioned version.
   
   ```c++
   // two level hash tables
   vector<unique_ptr<PartitionableHashTable>> intermediate_hts;
   vector<shared_ptr<GroupedAggregateHashTable>> finalized_hts;
   
   // radix_limit is hard coded to 10000
   if (llstate.total_groups >= radix_limit) {
                gstate.partitioned = true;
   }
   
   ```
   3) DuckDB uses `Row Layout`, the Row Layout keeps both the group keys and 
agg stats together.
   
   ```c++
   class RowLayout {
   .......
   private:
        //! The types of the data columns
        vector<LogicalType> types;
        //! The aggregate functions
        Aggregates aggregates;
        //! The width of the validity header
        idx_t flag_width;
        //! The width of the data portion
        idx_t data_width;
        //! The width of the aggregate state portion
        idx_t aggr_width;
        //! The width of the entire row
        idx_t row_width;
        //! The offsets to the columns and aggregate data in each row
        vector<idx_t> offsets;
        //! Whether all columns in this layout are constant size
        bool all_constant;
        //! Offset to the pointer to the heap for each row
        idx_t heap_pointer_offset;
   };
   }
   ```
   
   4) Due to the RowLayout and the uses of generics, the memory size of the  
group state in DuckDB is much smaller than DataFusion's, maybe more than `5 
times `smaller than ours.  I guess this is the major reason why they are faster 
and have a more efficient memory access pattern.
   
   DuckDB test result for tpch-q17
   
   ```c++
        std::cout << "layout column count:" << layout.ColumnCount() << 
std::endl;
        for (auto it = layout.GetTypes().cbegin(); it != 
layout.GetTypes().cend() ; ++it) std::cout << (*it).ToString() << " ";
        std::cout << "layout agg count:" << layout.AggregateCount() << 
std::endl;
        std::cout << "layout DataWidth:" << layout.GetDataWidth() << std::endl;
        std::cout << "layout RowWidth:" << layout.GetRowWidth() << std::endl;
   ```
   ```
   layout column count:2
   INTEGER UBIGINT 
   layout agg count:1
   layout DataWidth:15
   layout RowWidth:40
   ```
   
   DataFusion test result
   
   ```rust
       let acc = AvgAccumulator::try_new(
           &DataType::Decimal128(18, 0),
           &DataType::Decimal128(24, 0),
       )?;
   
       println!("AvgAccumulator size: {:?}", 
std::mem::size_of::<AvgAccumulator>());
   
       struct NormalAVGStruct {
           count: u64,
           sum: i128,
       }
   
       println!("NormalAVGStruct size: {:?}", 
std::mem::size_of::<NormalAVGStruct>());
       println!("GroupState size: {:?}", std::mem::size_of::<GroupState>());
       println!("Arrow OwnedRow size: {:?}", std::mem::size_of::<OwnedRow>());
       println!("Accumulator item vec size: {:?}", 
std::mem::size_of::<Vec<AccumulatorItem>>());
       println!("usize vec size: {:?}", std::mem::size_of::<Vec<usize>>());
       println!("boxed u8 size: {:?}", std::mem::size_of::<Box<[u8]>>());
   ```
   
   ```
   AvgAccumulator size: 112
   NormalAVGStruct size: 32
   GroupState size: 112
   Arrow OwnedRow size: 40
   Accumulator item vec size: 24
   usize vec size: 24
   boxed u8 size: 16
   ```
   
   Learn from me is that I think we should avoid using `Arc`, `Box`, `Vec`, 
`dyn Trait` in the  Group State,  those are all pointers, pointers will consume 
at least 8 bytes and pointers point to other memory address will lead to random 
access. 
   We should also avoid using `ScalarValue` in Accumulator , `ScalarValue` is 
the enum type and the memory size is padding to the most large member. That 
means we have to make the Accumulator generic.
   
   https://duckdb.org/2022/03/07/aggregate-hashtable.html


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to