alamb commented on issue #8860: URL: https://github.com/apache/arrow-datafusion/issues/8860#issuecomment-1891045590
> ah, okay. I was hoping some queries were similar enough to be accelerated
by that code, given a little effort, but it does not appear that way.
Indeed -- from my perspective the ClickBench queries can not be changed (as
we don't define them, they come from the clickbench benchmark itself). That is
why I propose adding an extended version here.
What I would like to do is to come up with some queries that do have the
pattern that is accelerated that use the same data. What about something like:
"Find the top 10 most recently active users"
```sql
WITH hits
AS® (SELECT "UserID", "EventDate"::INT::DATE::TIMESTAMP +
arrow_cast("EventTime", 'Interval(DayTime)') as time from 'hits.parquet')
SELECT min(time), "UserID" from hits GROUP BY "UserID" ORDER BY min(time)
limit 10;
```
Though this query appears not to use the optimization 🤔
```sql
❯ explain WITH hits AS (SELECT "UserID", "EventDate"::INT::DATE::TIMESTAMP +
arrow_cast("EventTime", 'Interval(DayTime)') as time from 'hits.parquet')
SELECT min(time), "UserID" from hits GROUP BY "UserID" ORDER BY min(time)
limit 10;
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan
|
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan | Limit: skip=0, fetch=10
|
| | Sort: MIN(hits.time) ASC NULLS LAST, fetch=10
|
| | Projection: MIN(hits.time), hits.UserID
|
| | Aggregate: groupBy=[[hits.UserID]],
aggr=[[MIN(hits.time)]]
|
| | SubqueryAlias: hits
|
| | Projection: hits.parquet.UserID,
CAST(CAST(CAST(hits.parquet.EventDate AS Int32) AS Date32) AS
Timestamp(Nanosecond, None)) + CAST(hits.parquet.EventTime AS
Interval(DayTime)) AS time
|
| | TableScan: hits.parquet projection=[EventTime,
EventDate, UserID]
|
| physical_plan | GlobalLimitExec: skip=0, fetch=10
|
| | SortPreservingMergeExec: [MIN(hits.time)@0 ASC NULLS
LAST], fetch=10
|
| | SortExec: TopK(fetch=10), expr=[MIN(hits.time)@0 ASC
NULLS LAST]
|
| | ProjectionExec: expr=[MIN(hits.time)@1 as
MIN(hits.time), UserID@0 as UserID]
|
| | AggregateExec: mode=FinalPartitioned,
gby=[UserID@0 as UserID], aggr=[MIN(hits.time)]
|
| | CoalesceBatchesExec: target_batch_size=8192
|
| | RepartitionExec: partitioning=Hash([UserID@0],
16), input_partitions=16
|
| | AggregateExec: mode=Partial, gby=[UserID@0
as UserID], aggr=[MIN(hits.time)]
|
| | ProjectionExec: expr=[UserID@2 as UserID,
CAST(CAST(CAST(EventDate@1 AS Int32) AS Date32) AS Timestamp(Nanosecond, None))
+ CAST(EventTime@0 AS Interval(DayTime)) as time]
|
| | ParquetExec: file_groups={16 groups:
[[Users/andrewlamb/Downloads/hits.parquet:0..923748528],
[Users/andrewlamb/Downloads/hits.parquet:923748528..1847497056],
[Users/andrewlamb/Downloads/hits.parquet:1847497056..2771245584],
[Users/andrewlamb/Downloads/hits.parquet:2771245584..3694994112],
[Users/andrewlamb/Downloads/hits.parquet:3694994112..4618742640], ...]},
projection=[EventTime, EventDate, UserID] |
| |
|
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set. Query took 0.048 seconds.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
