jhorstmann commented on issue #5547:
URL:
https://github.com/apache/arrow-datafusion/issues/5547#issuecomment-1463830701
> Rough outline how deduplication could work using existing arrow kernels:
The sorting would probably make this more expensive than the existing
accumulators. I missed that the question was in a datafusion context.
I think what is hindering the performance of the `DistinctCountAccumulator`
is the indirections of `Vec<ScalarValue>` and `ScalarValue` again containing a
string. The many enum values of `ScalarValue` probably also lead to a
complicated `eq` method.
One way to avoid these indirections would be to use the row format
internally in the accumulator. The state of the accumulator would consist of a
byte vector that contains serialized keys, and a `hashbrown::RawTable` that
only contains indices into that vector. The hash and eq calculation would then
also only work on these byte vectors making them more efficient. Storing all
keys in one vector should also improve cache locality.
Keys would be serialized into a byte vector,
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]