comphead commented on issue #5325: URL: https://github.com/apache/arrow-datafusion/issues/5325#issuecomment-1440566448
that was not that trivial I expected, so I have ran some experiments. - Update size incrementally for upcoming batch only -> doesn't seem to be a solution as we do not know in advance which hashes already counted and which are not. Expenses on calculating is higher than benefit. - Increase batch size from `8k` to `65k` improves the query speed 6 times(size function depends on batch and gets called less often) - remove state constituents from size function improves 20% - approx size(`first scalar value size * len()`) improves up to 10 times, but not accurate size for variable length, like strings @alamb @Dandandan let me know your thoughts -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
