alamb opened a new issue, #9403:
URL: https://github.com/apache/arrow-datafusion/issues/9403
### Is your feature request related to a problem or challenge?
As always I would like faster aggregation performance
clickbench, Q17 and Q18 include
```sql
SELECT "UserID", "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID",
"SearchPhrase" ORDER BY COUNT(*) DESC LIMIT 10;
SELECT "UserID", "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID",
"SearchPhrase" LIMIT 10;
SELECT "UserID", extract(minute FROM to_timestamp_seconds("EventTime")) AS
m, "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", m, "SearchPhrase"
ORDER BY COUNT(*) DESC LIMIT 10;
```
This is an Int 64 and string
```sql
DataFusion CLI v36.0.0
❯ describe 'hits.parquet';
+-----------------------+-----------+-------------+
| column_name | data_type | is_nullable |
+-----------------------+-----------+-------------+
...
| UserID | Int64 | NO |
...
| SearchPhrase | Utf8 | NO |
...
+-----------------------+-----------+-------------+
105 rows in set. Query took 0.035 seconds.
```
In some profiling of Q19, `SELECT "UserID", "SearchPhrase", COUNT(*) FROM
hits GROUP BY "UserID", "SearchPhrase" ORDER BY COUNT(*) DESC LIMIT 10;` I
found that 20-30% of the time is spent going from Array --> Row or Row -->
Array.
Thus I think adding some special handling for variable length data vs fixed
length data in the group management may help
### Describe the solution you'd like
_No response_
### Describe alternatives you've considered
_No response_
### Additional context
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]