Re: [PR] Optimize Dictionary groupings [datafusion]

via GitHub Wed, 10 Jun 2026 11:52:36 -0700


Rich-T-kid commented on PR #21765:
URL: https://github.com/apache/datafusion/pull/21765#issuecomment-4673419896


   I've experimented with switching between `GroupValuesRows` and 
`GroupValuesDictionary` at runtime depending on the size of the dictionary 
values array, but this isn't a good approach. DataFusion is a streaming engine, 
so the first batch of data we see isn't necessarily an indicator of future 
batches. If we decide to use `GroupValuesRows` due to high cardinality in the 
first incoming RecordBatch and each subsequent batch turns out to be low 
cardinality, we lose out on the benefits `GroupValuesDictionary` provides (a 
possible 5x speedup). The inverse is also true.
   Switching between the two approaches mid-aggregation wouldn't be effective 
either. Having to `Emit(EmitTo::All)` from one approach and then `intern()` 
into another would cause more overhead than any savings it could produce. 
   I think the best case to trust users to use the correct data types to 
represent their data, with `GroupValuesDictionary` it give them another option 
besides `GroupValuesRows` in case they do have low-medium cardinality.
   I'm open to hearing other possible approaches @alamb 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Optimize Dictionary groupings [datafusion]

Reply via email to