Re: [D] Sedona LEFT JOIN using SPATIAL INDEX [sedona]

via GitHub Thu, 11 Dec 2025 05:09:58 -0800


GitHub user alexeyegorov added a comment to the discussion: Sedona LEFT JOIN 
using SPATIAL INDEX

Hi @michelelaporta and @jiayuasu,

after reading this discussion here, I started again trying to somehow enable
LEFT JOIN strategy while using Spatial Join in Sedona.

### LEFT JOIN problem
Let me explain shortly our problem. First, we are running Sedona on Databricks
with Delta Lake (in case, there are any Databricks specific spark issues).
Then, we have country-wide datasets comparable to Overture Buildings dataset.
Now, to compute some building parameters we want to LEFT JOIN a data set to it
which is also country-wide (e.g. parcels). Using broadcast index join is not
really working out in the setup:

- it is unfeasable to broadcast one of these datasets
- building an spatial index on one of the datasets is also kind of unfeasable

### How we solved this issue so far?
Databricks and Sedona suggest to use grid systems like
[H3](https://www.databricks.com/blog/2019/12/05/processing-geospatial-data-at-scale-with-databricks.html)
or
[S2](https://sedona.apache.org/latest/api/sql/Optimizer/#google-s2-based-approximate-equi-join).
We applied this approach:

- create H3 cells
- explode the resulting array
- perform left join using cell id and then applying WHERE filter using
ST_INTERSECTS
- deduplicate the result (based on multiple cells per polygon possible)

This approach works, however it requires to create a way bigger data file
in-between, the code complexity increases, etc.

### Range Join as a solution

I spent yesterday several hours and tested different approaches starting with
the INNER JOIN that is using range join by Sedona. This as a starting point is
already performing way better then using an intermediate grid creation step.
However, we need a LEFT JOIN.

Thanks to several suggestions by Copilot, I found a solution to achieve exactly
what we need. Here is a sketch:

```sql
WITH inner_join AS (
SELECT
buildings.building_id
, parcels.parcel_id
FROM buildings
INNER JOIN parcels
ON ST_INTERSECTS(buildings.building_geometry, parcels.parcels_geometry)
)

, left_join AS (
SELECT
buildings.*
, COALESCE(inner_join.parcel_id, NULL) AS parcel_id
, COALESCE(inner_join.parcels_geometry, NULL) AS parcels_geometry
FROM buildings
FULL OUTER JOIN inner_join ON buildings.building_id = inner_join.building_id
)

SELECT * FROM left_join
```

1. First CTE performs an inner-join to retrieve all the matches. As this uses
RangeJoin strategy, the performance is great.
2. Second CTE extends the original `buildings` table by adding the matched
`parcels` entries where applicable.

I ran this setup a few times with a warmed-up cluster (8 nodes with 4 cores).
The previous grid solution took me (only the join of the exploded H3 tables and
its deduplication) around 15 min. This above solution doesn't require the
deduplication and H3 grid steps. Without even considering the H3 cell explosion
(writing files to disk etc.) the JOIN took me around 7 min.

### Summary

I wanted to share this solution to get some feedback and allow others to see
that. I am sure I am not the first one to run into it.
As a side note: the intermediate H3 dataset required more than 100GB of
storage, while the original is somewhere around 10-15GB. Meaning, this saves
storage costs and also read/write IO.

@jiayuasu would it be worth to extend the documentation with this solution? I
am not sure if Sedona would otherwise be able to use any spatial join to
perform the LEFT JOIN in the near future. If this solution is good, I can try
to submit a PR for documentation update. In this case, I would be happy if you
could hint me, where to start with that. ;)

GitHub link:
https://github.com/apache/sedona/discussions/2520#discussioncomment-15231076

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Re: [D] Sedona LEFT JOIN using SPATIAL INDEX [sedona]

Reply via email to