Rich-T-kid commented on PR #21765: URL: https://github.com/apache/datafusion/pull/21765#issuecomment-4673419896
I've experimented with switching between `GroupValuesRows` and `GroupValuesDictionary` at runtime depending on the size of the dictionary values array, but this isn't a good approach. DataFusion is a streaming engine, so the first batch of data we see isn't necessarily an indicator of future batches. If we decide to use `GroupValuesRows` due to high cardinality in the first incoming RecordBatch and each subsequent batch turns out to be low cardinality, we lose out on the benefits `GroupValuesDictionary` provides (a possible 5x speedup). The inverse is also true. Switching between the two approaches mid-aggregation wouldn't be effective either. Having to `Emit(EmitTo::All)` from one approach and then `intern()` into another would cause more overhead than any savings it could produce. I think the best case to trust users to use the correct data types to represent their data, with `GroupValuesDictionary` it give them another option besides `GroupValuesRows` in case they do have low-medium cardinality. I'm open to hearing other possible approaches @alamb -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
