Re: [PR] feat: Add comprehensive KNN join integration tests and benchmarks [sedona-db]

via GitHub Sat, 18 Oct 2025 04:52:28 -0700


2010YOUY01 commented on PR #65:
URL: https://github.com/apache/sedona-db/pull/65#issuecomment-3318061230


   It seems even for the largest dataset (2000 buildings x 1000 trips), sedona 
is only using 1 CPU core for the `SpatialJoinExec` phase. I think if you try a 
larger dataset at the join probe (trips) side, the speedup relative to 
DuckDB/pg can be higher. (That's a really impressive work!)
   
   I tried two data sizes:
   2000 buildings x 1000 trips: 25ms
   2000 buildings x 20000 trips: 50ms
   
   ## reproducer in `sedona-cli`:
   ```sql
   -- Setup
   CREATE OR REPLACE VIEW knn_buildings AS
   SELECT
       geometry AS geom,
       ROUND(random() * 1000) AS building_id,
       'Building_' || CAST(ROUND(random() * 1000) AS VARCHAR) AS name
   FROM sd_random_geometry('{
       "geom_type": "Polygon",
       "target_rows": 2000,
       "vertices_per_linestring_range": [4, 8],
       "size_range": [0.001, 0.01],
       "seed": 42
   }');
   
   CREATE OR REPLACE VIEW knn_trips AS
   SELECT
       geometry AS geom,
       ROUND(random() * 100000) AS trip_id
   FROM sd_random_geometry('{
       "geom_type": "Point",
       "target_rows": 100000,
       "seed": 43
   }');
   
   
   
   -- Benchmark Query
   WITH trip_sample AS (
       SELECT trip_id, geom AS trip_geom
       FROM knn_trips
       LIMIT 20000                               -- <== This line is changed
   ),
   building_with_geom AS (
       SELECT building_id, name, geom AS building_geom
       FROM knn_buildings
   )
   SELECT
       t.trip_id,
       b.building_id,
       b.name,
       ST_Distance(t.trip_geom, b.building_geom) AS distance
   FROM trip_sample t
   JOIN building_with_geom b ON ST_KNN(t.trip_geom, b.building_geom, 5, FALSE)
   ORDER BY t.trip_id, distance;
   ```
   
   ## Reasons
   The data source `sd_random_geometry` seems to produce batches in 1k size, 
and the execution layer is using the number of CPU cores as the parallelism 
degree by default. If there is only 1 batch, only one partition is working; my 
machine has 14 cores, so if I try 20000 as the probe side size, all partitions 
can be kept busy with a similar amount of work to do per partition as before.
   
   It can be verified through `explain analyze verbose + query`, and see 
`output_rows` field inside `SpatialJoinExec`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat: Add comprehensive KNN join integration tests and benchmarks [sedona-db]

Reply via email to