Rich-T-kid commented on PR #21589: URL: https://github.com/apache/datafusion/pull/21589#issuecomment-4253764562
### Benchmark Results: Dictionary vs GroupValueRows (single dict column) Benchmarks were parameterized across cardinality (XSmall/Small/Medium/Large), batch size (Small=8192 / Medium=32768 / Large=65536), and null rate (Zero/Low/Medium/High) — see [single_column_aggr.rs] for the full benchmark code. **XSmall cardinality, Small batch** Zero nulls: GroupValueRows 73.1 µs → Dictionary 33.9 µs (~2.2x) High nulls: GroupValueRows 94.1 µs → Dictionary 32.5 µs (~2.9x) **XSmall cardinality, Medium batch** Zero nulls: GroupValueRows 280.5 µs → Dictionary 143.6 µs (~2.0x) High nulls: GroupValueRows 389.9 µs → Dictionary 131.8 µs (~3.0x) **Large cardinality, Large batch** Zero nulls: GroupValueRows 568.1 µs → Dictionary 217.4 µs (~2.6x) High nulls: GroupValueRows 836.6 µs → Dictionary 253.7 µs (~3.3x) **Large cardinality, Large batch (multi_batch)** Zero nulls: GroupValueRows 1.668 ms → Dictionary 878 µs (~1.9x) High nulls: GroupValueRows 2.519 ms → Dictionary 741.6 µs (~3.4x) Dictionary is consistently 2–3.4x faster across all configurations. The speedup grows with higher null rates — as nulls increase, Dictionary benefits from its internal value caching, avoiding redundant work on null entries. ### Full benchmarks below [single_dict_column_result.txt](https://github.com/user-attachments/files/26756526/single_dict_column_result.txt) . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
