alamb edited a comment on issue #1708:
URL: 
https://github.com/apache/arrow-datafusion/issues/1708#issuecomment-1029014217


   > However, as the group-by key cardinality grows, the bottleneck of hash 
aggregation or hash join row concatenation becomes more memory access pattern 
related. 
   
   One thing we might consider is not storing the group key values directly in 
the hash table, but separately. Something like:
   
   ```text
    ┌─────────────┐               ┌─────────┬─────────┐   
    │             │               │Group Key│Group Key│   
    │  HashTable  │               │  Col 1  │  Col 2  │   
    │             │               ├─────────┼─────────┤   
    ├──────┬──────┤               │         │         │   
    │      │      │               │         │         │   
    │      │      │               │         │         │   
    │      │      │               │         │         │   
    │      │      │               ├ ─ ─ ─ ─ ┴ ─ ─ ─ ─ ┤   
    │      │      │       ┌───────▶        idx        │   
    │ key  │value │       │       ├ ─ ─ ─ ─ ┬ ─ ─ ─ ─ ┤   
    │      ├──────┤       │       │         │         │   
    │      │ idx  │───────┘       │         │         │   
    │      ├──────┤               │         │         │   
    │      │      │               │         │         │   
    │      │      │               │         │         │   
    └──────┴──────┘               │         │         │   
                                  │         │         │   
   HashTable holds indexes        │         │         │   
   to mutable arary               │         │         │   
                                  │   ...   │   ...   │   
                                  └─────────┴─────────┘   
                                                          
                                                          
                                Mutable (Appendable) Array
                                New group keys are        
                                appended at the end       
   ```
   
   The current hash aggregate code takes this approach -- but instead of using 
Mutable/Appendable arrays it uses a Vec of `GroupState`: 
https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/physical_plan/hash_aggregate.rs#L615
   
   > I'm not proposing to change all these data structures at once to row-wised 
but to point out existing theory and practice on using the row-wise 
organization for these pipeline breakers' states. We should always implement 
and benchmark before deciding on these performance-critical codes, as we did in 
the past.
   
   I agree with this 💯  and among other reasons is why I enjoy working with you 
(and the rest of the people on this chain!)
   
   BTW the DuckDB sorting blog 
https://duckdb.org/2021/08/27/external-sorting.html (and the paper it 
references by Goetz Grafe) have a good treatment of sorting (specifically the 
calculation of sort keys and then a secondary memory shuffle to sort move the 
original data around correctly)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to