GitHub user alexeyegorov edited a comment on the discussion: Sedona LEFT JOIN using SPATIAL INDEX
Hi @michelelaporta and @jiayuasu, after reading this discussion here, I started again trying to somehow enable LEFT JOIN strategy while using Spatial Join in Sedona. ### LEFT JOIN problem Let me explain shortly our problem. First, we are running Sedona on Databricks with Delta Lake (in case, there are any Databricks specific spark issues). Then, we have country-wide datasets comparable to Overture Buildings dataset. Now, to compute some building parameters we want to LEFT JOIN a data set to it which is also country-wide (e.g. parcels). Using broadcast index join is not really working out in the setup: - it is unfeasable to broadcast one of these datasets - building an spatial index on one of the datasets is also kind of unfeasable ### How we solved this issue so far? Databricks and Sedona suggest to use grid systems like [H3](https://www.databricks.com/blog/2019/12/05/processing-geospatial-data-at-scale-with-databricks.html) or [S2](https://sedona.apache.org/latest/api/sql/Optimizer/#google-s2-based-approximate-equi-join). We applied this approach: - create H3 cells - explode the resulting array - perform left join using cell id and then applying WHERE filter using ST_INTERSECTS - deduplicate the result (based on multiple cells per polygon possible) This approach works, however it requires to create a way bigger data file in-between, the code complexity increases, etc. ### Range Join as a solution I spent yesterday several hours and tested different approaches starting with the INNER JOIN that is using range join by Sedona. This as a starting point is already performing way better then using an intermediate grid creation step. However, we need a LEFT JOIN. Thanks to several suggestions by Copilot, I found a solution to achieve exactly what we need. Here is a sketch: ```sql WITH inner_join AS ( SELECT buildings.building_id , parcels.parcel_id FROM buildings INNER JOIN parcels ON ST_INTERSECTS(buildings.building_geometry, parcels.parcels_geometry) ) , left_join AS ( SELECT buildings.* , COALESCE(inner_join.parcel_id, NULL) AS parcel_id , COALESCE(inner_join.parcels_geometry, NULL) AS parcels_geometry FROM buildings FULL OUTER JOIN inner_join ON buildings.building_id = inner_join.building_id ) SELECT * FROM left_join ``` 1. First CTE performs an inner-join to retrieve all the matches. As this uses RangeJoin strategy, the performance is great. 2. Second CTE extends the original `buildings` table by adding the matched `parcels` entries where applicable. I ran this setup a few times with a warmed-up cluster (8 nodes with 4 cores). The previous grid solution took me (only the join of the exploded H3 tables and its deduplication) around 15 min. This above solution doesn't require the deduplication and H3 grid steps. Without even considering the H3 cell explosion (writing files to disk etc.) the JOIN took me around 7 min. ### Summary I wanted to share this solution to get some feedback and allow others to see that. I am sure I am not the first one to run into it. As a side note: the intermediate H3 dataset required more than 100GB of storage, while the original is somewhere around 10-15GB. Meaning, this saves storage costs and also read/write IO. As we are deploying Sedona almost completely using DBT, we can put this solution as a Jinja macro and reuse it with one line of code. Otherwise this could be exported as function. @jiayuasu would it be worth to extend the documentation with this solution? I am not sure if Sedona would otherwise be able to use any spatial join to perform the LEFT JOIN in the near future. If this solution is good, I can try to submit a PR for documentation update. In this case, I would be happy if you could hint me, where to start with that. ;) GitHub link: https://github.com/apache/sedona/discussions/2520#discussioncomment-15231076 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
