[GitHub] [arrow-datafusion] jhorstmann commented on issue #5547: Improve the performance of COUNT DISTINCT queries for high cardinality groups

via GitHub Fri, 10 Mar 2023 05:53:22 -0800


jhorstmann commented on issue #5547:
URL: 
https://github.com/apache/arrow-datafusion/issues/5547#issuecomment-1463830701


   > Rough outline how deduplication could work using existing arrow kernels:
   
   The sorting would probably make this more expensive than the existing 
accumulators. I missed that the question was in a datafusion context.
   
   I think what is hindering the performance of the `DistinctCountAccumulator` 
is the indirections of `Vec<ScalarValue>` and `ScalarValue` again containing a 
string. The many enum values of `ScalarValue` probably also lead to a 
complicated `eq` method.
   
   One way to avoid these indirections would be to use the row format 
internally in the accumulator. The state of the accumulator would consist of a 
byte vector that contains serialized keys, and a `hashbrown::RawTable` that 
only contains indices into that vector. The hash and eq calculation would then 
also only work on these byte vectors making them more efficient. Storing all 
keys in one vector should also improve cache locality.
   
    Keys would be serialized into a byte vector, 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] jhorstmann commented on issue #5547: Improve the performance of COUNT DISTINCT queries for high cardinality groups

Reply via email to