Kontinuation opened a new pull request, #91:
URL: https://github.com/apache/sedona-db/pull/91

   ## Background
   
   We found that range queries like the following one run very slowly using 
SedonaDB:
   
   ```sql
   SELECT COUNT(*) AS trip_count_in_coconino_county
   FROM trip t
   WHERE ST_Intersects(ST_GeomFromWKB(t.t_pickuploc), (SELECT 
ST_GeomFromWKB(z.z_boundary) FROM zone z WHERE z.z_name = 'Coconino County' 
LIMIT 1));
   ```
   
   This is a range query, where the query window is computed using a subquery 
`SELECT ST_GeomFromWKB(z.z_boundary) FROM zone z WHERE z.z_name = 'Coconino 
County' LIMIT 1`. This query is planned as a NestedLoopJoin followed by a 
filter:
   
   ```
   
┌───────────────┬───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
   │   plan_type   ┆                                                            
                                                                plan            
                                                                                
                               │
   │      utf8     ┆                                                            
                                                                utf8            
                                                                                
                               │
   
╞═══════════════╪═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
   │ logical_plan  ┆ Projection: count(Int64(1)) AS count(*) AS 
trip_count_in_coconino_county                                                   
                                                                                
                                               │
   │               ┆   Aggregate: groupBy=[[]], aggr=[[count(Int64(1))]]        
                                                                                
                                                                                
                               │
   │               ┆     Projection:                                            
                                                                                
                                                                                
                               │
   │               ┆       Filter: st_intersects(st_geomfromwkb(t.t_pickuploc), 
__scalar_sq_1.st_geomfromwkb(z.z_boundary))                                     
                                                                                
                               │
   │               ┆         Left Join:                                         
                                                                                
                                                                                
                               │
   │               ┆           SubqueryAlias: t                                 
                                                                                
                                                                                
                               │
   │               ┆             SubqueryAlias: trip                            
                                                                                
                                                                                
                               │
   │               ┆               TableScan: 
/Users/bopeng/workspace/data/spatial-bench/SpatialBench_sf=1_format=parquet/trip/*.parquet
 projection=[t_pickuploc]                                                       
                                                       │
   │               ┆           SubqueryAlias: __scalar_sq_1                     
                                                                                
                                                                                
                               │
   │               ┆             Projection: st_geomfromwkb(z.z_boundary)       
                                                                                
                                                                                
                               │
   │               ┆               SubqueryAlias: z                             
                                                                                
                                                                                
                               │
   │               ┆                 SubqueryAlias: zone                        
                                                                                
                                                                                
                               │
   │               ┆                   Projection: 
/Users/bopeng/workspace/data/spatial-bench/SpatialBench_sf=1_format=parquet/zone/*.parquet.z_boundary
                                                                                
                                       │
   │               ┆                     Limit: skip=0, fetch=1                 
                                                                                
                                                                                
                               │
   │               ┆                       Filter: 
/Users/bopeng/workspace/data/spatial-bench/SpatialBench_sf=1_format=parquet/zone/*.parquet.z_name
 = Utf8View("Coconino County")                                                  
                                           │
   │               ┆                         TableScan: 
/Users/bopeng/workspace/data/spatial-bench/SpatialBench_sf=1_format=parquet/zone/*.parquet
 projection=[z_name, z_boundary], 
partial_filters=[/Users/bopeng/workspace/data/spatial-bench/SpatialBench_sf=1_format=parqu
 │
   │               ┆ et/zone/*.parquet.z_name = Utf8View("Coconino County")]    
                                                                                
                                                                                
                               │
   
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
   │ physical_plan ┆ ProjectionExec: expr=[count(Int64(1))@0 as 
trip_count_in_coconino_county]                                                  
                                                                                
                                               │
   │               ┆   AggregateExec: mode=Final, gby=[], 
aggr=[count(Int64(1))]                                                          
                                                                                
                                                     │
   │               ┆     CoalescePartitionsExec                                 
                                                                                
                                                                                
                               │
   │               ┆       AggregateExec: mode=Partial, gby=[], 
aggr=[count(Int64(1))]                                                          
                                                                                
                                               │
   │               ┆         ProjectionExec: expr=[]                            
                                                                                
                                                                                
                               │
   │               ┆           CoalesceBatchesExec: target_batch_size=8192      
                                                                                
                                                                                
                               │
   │               ┆             FilterExec: 
st_intersects(st_geomfromwkb(t_pickuploc@0), st_geomfromwkb(z.z_boundary)@1)    
                                                                                
                                                                  │
   │               ┆               ProjectionExec: expr=[t_pickuploc@1 as 
t_pickuploc, st_geomfromwkb(z.z_boundary)@0 as st_geomfromwkb(z.z_boundary)]    
                                                                                
                                     │
   │               ┆                 NestedLoopJoinExec: join_type=Right        
                                                                                
                                                                                
                               │
   │               ┆                   CoalescePartitionsExec                   
                                                                                
                                                                                
                               │
   │               ┆                     ProjectionExec: 
expr=[st_geomfromwkb(z_boundary@0) as st_geomfromwkb(z.z_boundary)]             
                                                                                
                                                      │
   │               ┆                       RepartitionExec: 
partitioning=RoundRobinBatch(10), input_partitions=1                            
                                                                                
                                                   │
   │               ┆                         CoalescePartitionsExec: fetch=1    
                                                                                
                                                                                
                               │
   │               ┆                           ProjectionExec: 
expr=[z_boundary@1 as z_boundary]                                               
                                                                                
                                                │
   │               ┆                             CoalesceBatchesExec: 
target_batch_size=8192, fetch=1                                                 
                                                                                
                                         │
   │               ┆                               FilterExec: z_name@0 = 
Coconino County                                                                 
                                                                                
                                     │
   │               ┆                                 DataSourceExec: 
file_groups={10 groups: 
[[Users/bopeng/workspace/data/spatial-bench/SpatialBench_sf=1_format=parquet/zone/part-1.parquet:0..141525345],
 [Users/bopeng/workspace/data/spatial-bench/SpatialBench_sf=1_form │
   │               ┆ at=parquet/zone/part-1.parquet:141525345..283050690], 
[Users/bopeng/workspace/data/spatial-bench/SpatialBench_sf=1_format=parquet/zone/part-1.parquet:283050690..424576035],
 [Users/bopeng/workspace/data/spatial-bench/SpatialBench_sf=1_format=parquet/ │
   │               ┆ zone/part-1.parquet:424576035..566101380], 
[Users/bopeng/workspace/data/spatial-bench/SpatialBench_sf=1_format=parquet/zone/part-1.parquet:566101380..707626725],
 ...]}, projection=[z_name, z_boundary], file_type=parquet                      
         │
   │               ┆                   DataSourceExec: file_groups={10 groups: 
[[Users/bopeng/workspace/data/spatial-bench/SpatialBench_sf=1_format=parquet/trip/part-1.parquet:0..36478017],
 
[Users/bopeng/workspace/data/spatial-bench/SpatialBench_sf=1_format=parquet/trip
 │
   │               ┆ /part-1.parquet:36478017..72956034], 
[Users/bopeng/workspace/data/spatial-bench/SpatialBench_sf=1_format=parquet/trip/part-1.parquet:72956034..109434051],
 
[Users/bopeng/workspace/data/spatial-bench/SpatialBench_sf=1_format=parquet/trip/part-1.parque
 │
   │               ┆ t:109434051..145912068], 
[Users/bopeng/workspace/data/spatial-bench/SpatialBench_sf=1_format=parquet/trip/part-1.parquet:145912068..182390085],
 ...]}, projection=[t_pickuploc], file_type=parquet                             
                           │
   │               ┆                                                            
                                                                                
                                                                                
                               │
   
└───────────────┴───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
   ```
   
   The problem with this approach is that we cannot make use of the fact that 
the query window is a constant. We'll end up with repeatedly decoding WKB and 
constructing geometry objects for the query window. This is a waste of work 
especially when the query window is a very complex shape.
   
   ## Changes made by this PR
   
   This PR optimizes the logical plan of the above query to merge the Filter 
and Join operators as one Join operator with filter (theta-join). This enables 
the physical optimization to convert the NestedLoopJoin with spatial filter 
into a SpatialJoin.
   
   ```
   
┌───────────────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
   │   plan_type   ┆                                                            
                           plan                                                 
                                     │
   │      utf8     ┆                                                            
                           utf8                                                 
                                     │
   
╞═══════════════╪═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
   │ logical_plan  ┆ Projection: count(Int64(1)) AS count(*) AS 
trip_count_in_coconino_county                                                   
                                                     │
   │               ┆   Aggregate: groupBy=[[]], aggr=[[count(Int64(1))]]        
                                                                                
                                     │
   │               ┆     Projection:                                            
                                                                                
                                     │
   │               ┆       Inner Join:  Filter: 
st_intersects(st_geomfromwkb(t.t_pickuploc), 
__scalar_sq_1.st_geomfromwkb(z.z_boundary))                                     
                        │
   │               ┆         SubqueryAlias: t                                   
                                                                                
                                     │
   │               ┆           SubqueryAlias: trip                              
                                                                                
                                     │
   │               ┆             TableScan: 
/Users/bopeng/workspace/data/spatial-bench/SpatialBench_sf=1_format=parquet/trip/*.parquet
 projection=[t_pickuploc]                                      │
   │               ┆         SubqueryAlias: __scalar_sq_1                       
                                                                                
                                     │
   │               ┆           Projection: st_geomfromwkb(z.z_boundary)         
                                                                                
                                     │
   │               ┆             SubqueryAlias: z                               
                                                                                
                                     │
   │               ┆               SubqueryAlias: zone                          
                                                                                
                                     │
   │               ┆                 Projection: 
/Users/bopeng/workspace/data/spatial-bench/SpatialBench_sf=1_format=parquet/zone/*.parquet.z_boundary
                                               │
   │               ┆                   Limit: skip=0, fetch=1                   
                                                                                
                                     │
   │               ┆                     Filter: 
/Users/bopeng/workspace/data/spatial-bench/SpatialBench_sf=1_format=parquet/zone/*.parquet.z_name
 = Utf8View("Coconino County")                     │
   │               ┆                       TableScan: 
/Users/bopeng/workspace/data/spatial-bench/SpatialBench_sf=1_format=parquet/zone/*.parquet
 projection=[z_name, z_boundary], partial_filters=[/ │
   │               ┆ 
Users/bopeng/workspace/data/spatial-bench/SpatialBench_sf=1_format=parquet/zone/*.parquet.z_name
 = Utf8View("Coconino County")]                                                 
│
   
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
   │ physical_plan ┆ ProjectionExec: expr=[count(Int64(1))@0 as 
trip_count_in_coconino_county]                                                  
                                                     │
   │               ┆   AggregateExec: mode=Final, gby=[], 
aggr=[count(Int64(1))]                                                          
                                                           │
   │               ┆     CoalescePartitionsExec                                 
                                                                                
                                     │
   │               ┆       AggregateExec: mode=Partial, gby=[], 
aggr=[count(Int64(1))]                                                          
                                                     │
   │               ┆         ProjectionExec: expr=[]                            
                                                                                
                                     │
   │               ┆           SpatialJoinExec: join_type=Inner, 
on=ST_intersects(st_geomfromwkb(z.z_boundary)@0, st_geomfromwkb(t_pickuploc@0)) 
                                                    │
   │               ┆             ProjectionExec: 
expr=[st_geomfromwkb(z_boundary@0) as st_geomfromwkb(z.z_boundary)]             
                                                                    │
   │               ┆               RepartitionExec: 
partitioning=RoundRobinBatch(10), input_partitions=1                            
                                                                 │
   │               ┆                 CoalescePartitionsExec: fetch=1            
                                                                                
                                     │
   │               ┆                   ProjectionExec: expr=[z_boundary@1 as 
z_boundary]                                                                     
                                        │
   │               ┆                     CoalesceBatchesExec: 
target_batch_size=8192, fetch=1                                                 
                                                       │
   │               ┆                       FilterExec: z_name@0 = Coconino 
County                                                                          
                                          │
   │               ┆                         DataSourceExec: file_groups={10 
groups: 
[[Users/bopeng/workspace/data/spatial-bench/SpatialBench_sf=1_format=parquet/zone/part-1.parquet:0..141525345],
 │
   │               ┆  
[Users/bopeng/workspace/data/spatial-bench/SpatialBench_sf=1_format=parquet/zone/part-1.parquet:141525345..283050690],
 [Users/bopeng/workspace/data/spatial-bench/SpatialBench │
   │               ┆ 
_sf=1_format=parquet/zone/part-1.parquet:283050690..424576035], 
[Users/bopeng/workspace/data/spatial-bench/SpatialBench_sf=1_format=parquet/zone/part-1.parquet:424576035..5661
 │
   │               ┆ 01380], 
[Users/bopeng/workspace/data/spatial-bench/SpatialBench_sf=1_format=parquet/zone/part-1.parquet:566101380..707626725],
 ...]}, projection=[z_name, z_boundary], file_typ │
   │               ┆ e=parquet                                                  
                                                                                
                                     │
   │               ┆             DataSourceExec: file_groups={10 groups: 
[[Users/bopeng/workspace/data/spatial-bench/SpatialBench_sf=1_format=parquet/trip/part-1.parquet:0..36478017],
 [Users/bopen │
   │               ┆ 
g/workspace/data/spatial-bench/SpatialBench_sf=1_format=parquet/trip/part-1.parquet:36478017..72956034],
 [Users/bopeng/workspace/data/spatial-bench/SpatialBench_sf=1_format=pa │
   │               ┆ rquet/trip/part-1.parquet:72956034..109434051], 
[Users/bopeng/workspace/data/spatial-bench/SpatialBench_sf=1_format=parquet/trip/part-1.parquet:109434051..145912068],
 [Users/b │
   │               ┆ 
openg/workspace/data/spatial-bench/SpatialBench_sf=1_format=parquet/trip/part-1.parquet:145912068..182390085],
 ...]}, projection=[t_pickuploc], file_type=parquet               │
   │               ┆                                                            
                                                                                
                                     │
   
└───────────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
   ```
   
   ## Performance Improvements
   
   The query mentioned in the background section took 21 seconds before 
optimization, and took 0.4 seconds after this optimization.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@sedona.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to