[GitHub] [arrow-datafusion] alamb commented on issue #5325: Optimize Accumulator `size` function performance (fix regression on clickbench)

via GitHub Thu, 23 Feb 2023 05:19:24 -0800


alamb commented on issue #5325:
URL: 
https://github.com/apache/arrow-datafusion/issues/5325#issuecomment-1441767680


   > approx size(first scalar value size * len()) improves up to 10 times, but 
not accurate size for variable length, like strings
   
   Thank you for looking into this @comphead 
   
   I think we should definitely use this approach for fixed length (non 
variable length) data -- it will solve the performance regression we saw for 
clickbench
   
   In terms of handling variable length data more efficiently, I am not sure it 
is worth a lot of time optimizing the `size()` implementation because my guess 
is that the `size()` calculation will be a smaller portion of the overall 
runtime for high cardinality string columns (each one will be an allocated 
string, for example).
   
   I think a separate project to handle `COUNT 
DISTINCT(high_cardinality_string_column)` is probably needed
   
   Thus I propose:
   1. Implement the simplest thing that fixes the clickbench performance 
regression
   2. File a follow on ticket to track further improving the performance of 
COUNT DISTINCT queries (I can do this)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb commented on issue #5325: Optimize Accumulator `size` function performance (fix regression on clickbench)

Reply via email to