GitHub user alexeyegorov added a comment to the discussion: Sedona LEFT JOIN 
using SPATIAL INDEX

Hi @michelelaporta and @jiayuasu,

after reading this discussion here, I started again trying to somehow enable 
LEFT JOIN strategy while using Spatial Join in Sedona.

### LEFT JOIN problem
Let me explain shortly our problem. First, we are running Sedona on Databricks 
with Delta Lake (in case, there are any Databricks specific spark issues). 
Then, we have country-wide datasets comparable to Overture Buildings dataset. 
Now, to compute some building parameters we want to LEFT JOIN a data set to it 
which is also country-wide (e.g. parcels). Using broadcast index join is not 
really working out in the setup:

- it is unfeasable to broadcast one of these datasets
- building an spatial index on one of the datasets is also kind of unfeasable

### How we solved this issue so far?
Databricks and Sedona suggest to use grid systems like 
[H3](https://www.databricks.com/blog/2019/12/05/processing-geospatial-data-at-scale-with-databricks.html)
 or 
[S2](https://sedona.apache.org/latest/api/sql/Optimizer/#google-s2-based-approximate-equi-join).
 We applied this approach:

- create H3 cells
- explode the resulting array
- perform left join using cell id and then applying WHERE filter using 
ST_INTERSECTS 
- deduplicate the result (based on multiple cells per polygon possible)

This approach works, however it requires to create a way bigger data file 
in-between, the code complexity increases, etc.

### Range Join as a solution

I spent yesterday several hours and tested different approaches starting with 
the INNER JOIN that is using range join by Sedona. This as a starting point is 
already performing way better then using an intermediate grid creation step. 
However, we need a LEFT JOIN.

Thanks to several suggestions by Copilot, I found a solution to achieve exactly 
what we need. Here is a sketch:

```sql
WITH inner_join AS (
    SELECT
        buildings.building_id
        , parcels.parcel_id
    FROM buildings
    INNER JOIN parcels
        ON ST_INTERSECTS(buildings.building_geometry, parcels.parcels_geometry)
)

, left_join AS (
    SELECT
        buildings.*
        , COALESCE(inner_join.parcel_id, NULL) AS parcel_id
        , COALESCE(inner_join.parcels_geometry, NULL) AS parcels_geometry
    FROM buildings
    FULL OUTER JOIN inner_join ON buildings.building_id = inner_join.building_id
)

SELECT * FROM left_join
```

1. First CTE performs an inner-join to retrieve all the matches. As this uses 
RangeJoin strategy, the performance is great.
2. Second CTE extends the original `buildings` table by adding the matched 
`parcels` entries where applicable. 

I ran this setup a few times with a warmed-up cluster (8 nodes with 4 cores). 
The previous grid solution took me (only the join of the exploded H3 tables and 
its deduplication) around 15 min. This above solution doesn't require the 
deduplication and H3 grid steps. Without even considering the H3 cell explosion 
(writing files to disk etc.) the JOIN took me around 7 min.

### Summary

I wanted to share this solution to get some feedback and allow others to see 
that. I am sure I am not the first one to run into it. 
As a side note: the intermediate H3 dataset required more than 100GB of 
storage, while the original is somewhere around 10-15GB. Meaning, this saves 
storage costs and also read/write IO.

@jiayuasu would it be worth to extend the documentation with this solution? I 
am not sure if Sedona would otherwise be able to use any spatial join to 
perform the LEFT JOIN in the near future. If this solution is good, I can try 
to submit a PR for documentation update. In this case, I would be happy if you 
could hint me, where to start with that. ;) 

GitHub link: 
https://github.com/apache/sedona/discussions/2520#discussioncomment-15231076

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to