[GitHub] [arrow-datafusion] comphead commented on issue #5325: Optimize Accumulator `size` function performance (fix regression on clickbench)

via GitHub Wed, 22 Feb 2023 10:25:00 -0800


comphead commented on issue #5325:
URL: 
https://github.com/apache/arrow-datafusion/issues/5325#issuecomment-1440566448


   that was not that trivial I expected, so I  have ran some experiments. 
   - Update size incrementally for upcoming batch only -> doesn't seem to be a 
solution as we do not know in advance which hashes already counted and which 
are not. Expenses on calculating is higher than benefit.
   - Increase batch size from `8k` to `65k` improves the query speed 6 
times(size function depends on batch and gets called less often)
   - remove state constituents from size function improves 20%
   - approx size(`first scalar value size * len()`) improves up to 10 times, 
but not accurate size for variable length, like strings
   
   @alamb @Dandandan let me know your thoughts
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] comphead commented on issue #5325: Optimize Accumulator `size` function performance (fix regression on clickbench)

Reply via email to