[I] Improve performance for grouping by variable length columns (strings) [arrow-datafusion]

via GitHub Thu, 29 Feb 2024 08:24:11 -0800


alamb opened a new issue, #9403:
URL: https://github.com/apache/arrow-datafusion/issues/9403


   ### Is your feature request related to a problem or challenge?
   
   As always I would like faster aggregation performance
   
   clickbench, Q17 and Q18 include 
   
   ```sql
   SELECT "UserID", "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", 
"SearchPhrase" ORDER BY COUNT(*) DESC LIMIT 10;
   SELECT "UserID", "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", 
"SearchPhrase" LIMIT 10;
   SELECT "UserID", extract(minute FROM to_timestamp_seconds("EventTime")) AS 
m, "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", m, "SearchPhrase" 
ORDER BY COUNT(*) DESC LIMIT 10;
   ```
   
   This is an Int 64 and string
   ```sql
   DataFusion CLI v36.0.0
   ❯ describe 'hits.parquet';
   +-----------------------+-----------+-------------+
   | column_name           | data_type | is_nullable |
   +-----------------------+-----------+-------------+
   ...
   | UserID                | Int64     | NO          |
   ...
   | SearchPhrase          | Utf8      | NO          |
   ...
   +-----------------------+-----------+-------------+
   105 rows in set. Query took 0.035 seconds.
   ```
   
   In some profiling of Q19, `SELECT "UserID", "SearchPhrase", COUNT(*) FROM 
hits GROUP BY "UserID", "SearchPhrase" ORDER BY COUNT(*) DESC LIMIT 10;` I 
found that 20-30% of the time is spent going from Array --> Row or Row --> 
Array.
   
   Thus I think adding some special handling for variable length data vs fixed 
length data in the group management may help
    
   
   
   ### Describe the solution you'd like
   
   _No response_
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Improve performance for grouping by variable length columns (strings) [arrow-datafusion]

Reply via email to