2010YOUY01 commented on PR #65:
URL: https://github.com/apache/sedona-db/pull/65#issuecomment-3318061230
It seems even for the largest dataset (2000 buildings x 1000 trips), sedona
is only using 1 CPU core for the `SpatialJoinExec` phase. I think if you try a
larger dataset at the join probe (trips) side, the speedup relative to
DuckDB/pg can be higher. (That's a really impressive work!)
I tried two data sizes:
2000 buildings x 1000 trips: 25ms
2000 buildings x 20000 trips: 50ms
## reproducer in `sedona-cli`:
```sql
-- Setup
CREATE OR REPLACE VIEW knn_buildings AS
SELECT
geometry AS geom,
ROUND(random() * 1000) AS building_id,
'Building_' || CAST(ROUND(random() * 1000) AS VARCHAR) AS name
FROM sd_random_geometry('{
"geom_type": "Polygon",
"target_rows": 2000,
"vertices_per_linestring_range": [4, 8],
"size_range": [0.001, 0.01],
"seed": 42
}');
CREATE OR REPLACE VIEW knn_trips AS
SELECT
geometry AS geom,
ROUND(random() * 100000) AS trip_id
FROM sd_random_geometry('{
"geom_type": "Point",
"target_rows": 100000,
"seed": 43
}');
-- Benchmark Query
WITH trip_sample AS (
SELECT trip_id, geom AS trip_geom
FROM knn_trips
LIMIT 20000 -- <== This line is changed
),
building_with_geom AS (
SELECT building_id, name, geom AS building_geom
FROM knn_buildings
)
SELECT
t.trip_id,
b.building_id,
b.name,
ST_Distance(t.trip_geom, b.building_geom) AS distance
FROM trip_sample t
JOIN building_with_geom b ON ST_KNN(t.trip_geom, b.building_geom, 5, FALSE)
ORDER BY t.trip_id, distance;
```
## Reasons
The data source `sd_random_geometry` seems to produce batches in 1k size,
and the execution layer is using the number of CPU cores as the parallelism
degree by default. If there is only 1 batch, only one partition is working; my
machine has 14 cores, so if I try 20000 as the probe side size, all partitions
can be kept busy with a similar amount of work to do per partition as before.
It can be verified through `explain analyze verbose + query`, and see
`output_rows` field inside `SpatialJoinExec`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]