2010YOUY01 commented on PR #15324:
URL: https://github.com/apache/datafusion/pull/15324#issuecomment-2739672614

   > Thank you @waynexia, I'm planning to check it out at most tomorrow.
   > 
   > I have a question in advance before reviewing -- have you been considering 
to implement groups accumulator for 
[specialized](https://github.com/apache/datafusion/tree/main/datafusion/functions-aggregate-common/src/aggregate/count_distinct)
 cases of DistinctCountAccumulator (primitive/native types and bytes)?
   > 
   > I'm asking because, as for me, it looks a bit odd (though I haven't 
rechecked performance results, and perhaps GroupsAccumulatorAdapter introduces 
some insane overhead), that switching from native Rust types to ScalarValue 
still gives x5 faster execution, while groups accumulator in this case, if I'm 
not mistaken, does basically the same as the GroupsAccumulatorAdapter -- 
storing separate states (hashsets) in the vector is already 
[implemented](https://github.com/apache/datafusion/blob/97548a2584614acb3211b862e1ebca8349b6a832/datafusion/functions-aggregate-common/src/aggregate/groups_accumulator.rs#L96)
 in the adapter.
   
   Here are my thoughts on why 5X given implementations are very similar:
   `GroupsAccumulator` for `median()` in 
https://github.com/apache/datafusion/pull/13681 also achieved 5X speed-up, we 
found 2X is due to old `Accumulator` implementation has a inefficient 
`vec<ScalarValue>  <---> ListArray` conversion when `state()` and 
`merge_batch()` is called, and the remaining 2X difference I believe is due to 
inefficiency in adaptor, or extra memory allocation for many outer state 
struct. Perhaps `Accumulator` for `DistinctCount` is also not implementing this 
list conversion efficiently 🤔 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to