(sedona) 03/03: [DOCS] Add spatial left join docs solution (#2561)

jiayu Tue, 16 Dec 2025 12:43:13 -0800

This is an automated email from the ASF dual-hosted git repository.

jiayu pushed a commit to branch branch-1.8.0
in repository https://gitbox.apache.org/repos/asf/sedona.git


commit e428c6dfb38e910b597bb8ba9f9ae0c8e095b3db
Author: Alexey <[email protected]>
AuthorDate: Tue Dec 16 19:44:23 2025 +0100

    [DOCS] Add spatial left join docs solution (#2561)
---
 docs/api/sql/Optimizer.md | 45 +++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 43 insertions(+), 2 deletions(-)

diff --git a/docs/api/sql/Optimizer.md b/docs/api/sql/Optimizer.md
index aa74e2b4bc..ca14054b1e 100644
--- a/docs/api/sql/Optimizer.md
+++ b/docs/api/sql/Optimizer.md
@@ -132,7 +132,7 @@ DistanceJoin pointshape1#12: geometry, pointshape2#33: 
geometry, 2.0, true
 
 Spark SQL Example for meter-based geodesic distance `ST_DistanceSpheroid` 
(works for `ST_DistanceSphere` too):
 
-*Less than a certain distance==*
+*==Less than a certain distance==*
 
 ```sql
 SELECT *
@@ -140,7 +140,7 @@ FROM pointdf1, pointdf2
 WHERE ST_DistanceSpheroid(pointdf1.pointshape1,pointdf2.pointshape2) < 2
 ```
 
-*Less than or equal to a certain distance==*
+*==Less than or equal to a certain distance==*
 
 ```sql
 SELECT *
@@ -151,6 +151,47 @@ WHERE 
ST_DistanceSpheroid(pointdf1.pointshape1,pointdf2.pointshape2) <= 2
 !!!warning
        If you use `ST_DistanceSpheroid` or `ST_DistanceSphere` as the 
predicate, the unit of the distance is meter. Currently, distance join with 
geodesic distance calculators work best for point data. For non-point data, it 
only considers their centroids.
 
+## Spatial Left Join
+
+Introduction: Perform a left join using the spatial performance of a range or 
distance join.
+This allows to find geometries from A and B matching the join criteria while 
also keeping those entries from A that do not match any geometry in B.
+
+Range and distance joins ==do not support== a LEFT JOIN as below:
+
+```sql
+SELECT a.*, b.* FROM a
+LEFT JOIN b ON ST_INTERSECTS(a.geometry, b.geometry)
+```
+
+This will lead to a **BroadcastIndexJoin** which can become very inefficient 
with two large datasets.
+Otherwise, the **BroadcastNestedLoopJoin** is triggered which is the slowest 
option.
+
+In order to make use of Sedona's spatial join performance, it is possible to 
produce the result of a left join by combining an INNER JOIN with a LEFT JOIN.
+
+1. With the inner join, we collect the ID from the left side and all the 
required columns from the right side (consider the result as **A'**)
+2. In the second step, we combine the left side A with the result of the inner 
join **A'**.
+   All the entries of A are kept as they are, while the entries of right side 
B are forwarded through **A'**.
+
+```sql
+WITH inner_join AS (
+    SELECT
+        dfA.a_id
+        , dfB.b_id
+    FROM dfA, dfB
+    WHERE ST_INTERSECTS(dfA.geometry, dfB.geometry)
+)
+
+SELECT
+    dfA.*,
+    inner_join.b_id
+FROM dfA
+LEFT JOIN inner_join
+   ON dfA.a_id = inner_join.a_id;
+```
+
+!!!note
+       One can define this strategy as stored procedure or a DBT macro to 
simplify the repeated code.
+
 ## Broadcast index join
 
 Introduction: Perform a range join or distance join but broadcast one of the 
sides of the join. This maintains the partitioning of the non-broadcast side 
and doesn't require a shuffle.

(sedona) 03/03: [DOCS] Add spatial left join docs solution (#2561)

Reply via email to