Re: [I] Add 'clickbench_extended' benchmark [arrow-datafusion]

via GitHub Sun, 14 Jan 2024 11:15:13 -0800


alamb commented on issue #8860:
URL: 
https://github.com/apache/arrow-datafusion/issues/8860#issuecomment-1891045590


   > ah, okay. I was hoping some queries were similar enough to be accelerated 
by that code, given a little effort, but it does not appear that way.
   
   Indeed -- from my perspective the ClickBench queries can not be changed (as 
we don't define them, they come from the clickbench benchmark itself). That is 
why I propose adding an extended version here. 
   
   What I would like to do is to come up with some queries that do have the 
pattern that is accelerated that use the same data.  What about something like:
   
   "Find the top 10 most recently active users"
   
   ```sql
   WITH hits 
   AS® (SELECT "UserID", "EventDate"::INT::DATE::TIMESTAMP + 
arrow_cast("EventTime", 'Interval(DayTime)') as time from 'hits.parquet') 
   SELECT min(time), "UserID" from hits GROUP BY "UserID" ORDER BY min(time)  
limit 10;
   ```
   
   Though this query appears not to use the optimization 🤔 
   
   ```sql
   ❯ explain WITH hits AS (SELECT "UserID", "EventDate"::INT::DATE::TIMESTAMP + 
arrow_cast("EventTime", 'Interval(DayTime)') as time from 'hits.parquet') 
SELECT min(time), "UserID" from hits GROUP BY "UserID" ORDER BY min(time)  
limit 10;
   
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | plan_type     | plan                                                       
                                                                                
                                                                                
                                                                                
                                                                                
                                             |
   
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   | logical_plan  | Limit: skip=0, fetch=10                                    
                                                                                
                                                                                
                                                                                
                                                                                
                                             |
   |               |   Sort: MIN(hits.time) ASC NULLS LAST, fetch=10            
                                                                                
                                                                                
                                                                                
                                                                                
                                             |
   |               |     Projection: MIN(hits.time), hits.UserID                
                                                                                
                                                                                
                                                                                
                                                                                
                                             |
   |               |       Aggregate: groupBy=[[hits.UserID]], 
aggr=[[MIN(hits.time)]]                                                         
                                                                                
                                                                                
                                                                                
                                                              |
   |               |         SubqueryAlias: hits                                
                                                                                
                                                                                
                                                                                
                                                                                
                                             |
   |               |           Projection: hits.parquet.UserID, 
CAST(CAST(CAST(hits.parquet.EventDate AS Int32) AS Date32) AS 
Timestamp(Nanosecond, None)) + CAST(hits.parquet.EventTime AS 
Interval(DayTime)) AS time                                                      
                                                                                
                                                                                
                 |
   |               |             TableScan: hits.parquet projection=[EventTime, 
EventDate, UserID]                                                              
                                                                                
                                                                                
                                                                                
                                             |
   | physical_plan | GlobalLimitExec: skip=0, fetch=10                          
                                                                                
                                                                                
                                                                                
                                                                                
                                             |
   |               |   SortPreservingMergeExec: [MIN(hits.time)@0 ASC NULLS 
LAST], fetch=10                                                                 
                                                                                
                                                                                
                                                                                
                                                 |
   |               |     SortExec: TopK(fetch=10), expr=[MIN(hits.time)@0 ASC 
NULLS LAST]                                                                     
                                                                                
                                                                                
                                                                                
                                               |
   |               |       ProjectionExec: expr=[MIN(hits.time)@1 as 
MIN(hits.time), UserID@0 as UserID]                                             
                                                                                
                                                                                
                                                                                
                                                        |
   |               |         AggregateExec: mode=FinalPartitioned, 
gby=[UserID@0 as UserID], aggr=[MIN(hits.time)]                                 
                                                                                
                                                                                
                                                                                
                                                          |
   |               |           CoalesceBatchesExec: target_batch_size=8192      
                                                                                
                                                                                
                                                                                
                                                                                
                                             |
   |               |             RepartitionExec: partitioning=Hash([UserID@0], 
16), input_partitions=16                                                        
                                                                                
                                                                                
                                                                                
                                             |
   |               |               AggregateExec: mode=Partial, gby=[UserID@0 
as UserID], aggr=[MIN(hits.time)]                                               
                                                                                
                                                                                
                                                                                
                                               |
   |               |                 ProjectionExec: expr=[UserID@2 as UserID, 
CAST(CAST(CAST(EventDate@1 AS Int32) AS Date32) AS Timestamp(Nanosecond, None)) 
+ CAST(EventTime@0 AS Interval(DayTime)) as time]                               
                                                                                
                                                                                
                                              |
   |               |                   ParquetExec: file_groups={16 groups: 
[[Users/andrewlamb/Downloads/hits.parquet:0..923748528], 
[Users/andrewlamb/Downloads/hits.parquet:923748528..1847497056], 
[Users/andrewlamb/Downloads/hits.parquet:1847497056..2771245584], 
[Users/andrewlamb/Downloads/hits.parquet:2771245584..3694994112], 
[Users/andrewlamb/Downloads/hits.parquet:3694994112..4618742640], ...]}, 
projection=[EventTime, EventDate, UserID] |
   |               |                                                            
                                                                                
                                                                                
                                                                                
                                                                                
                                             |
   
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
   2 rows in set. Query took 0.048 seconds.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Add 'clickbench_extended' benchmark [arrow-datafusion]

Reply via email to