Re: [I] perf: Support LEFT/RIGHT OUTER Spatial Joins for RangeJoinExec and DistanceJoinExec [sedona]

via GitHub Tue, 30 Dec 2025 15:29:00 -0800


petern48 commented on issue #2570:
URL: https://github.com/apache/sedona/issues/2570#issuecomment-3700841643

Hmm, ok. So we can get the matches easily with the traditional spatial join,
but the challenge lies in finding all of the non-matches on the left side in a
scalable way (i.e. we can't use the same approach implemented in
`BroadcastIndexJoinExec`). Spark traditionally will try a SortMergeJoin if the
join key is linearly sortable, but spatial data is not, so that doesn't work
for us. It instead falls back on a `BroadcastNestedLoopJoinExec`.

I think at a theoretical level, there isn't a better way to extend support
to outer joins aside from the workaround mentioned in the
[PR](https://github.com/apache/sedona/pull/2561): post-filter the results after
the spatial inner join. I think it would technicall possible to implement that
natively in the query optimizer, but two complications come to mind:

1. We can't assume an id already exists in the table (which is needed to
determine if there are no matches across all partitions. We could technically
first generate a
[monotonically_increasing_id()](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.monotonically_increasing_id.html)
column, though that would be expensive too (sorting required). Not sure how
easy it would be to auto-detect an unique id column from code.

2. This would be a logical plan rewrite based on the physical plan, since we
don't want to modify behavior for cases small enough to use
`BroadcastIndexJoinExec`.

```
Generate PhysicalPlan from LogicalPlan (normal behavior)
If JoinExec is not BroadcastIndexJoinExec,
Then perform the above logical plan rewrite.
Then, go back and regenerate the physical plan from the new Logical Plan
```

Anyways, I think that implementation is more complicated than it is worth,
for now. Didn't think about it too much when I first created the issue. Closing.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] perf: Support LEFT/RIGHT OUTER Spatial Joins for RangeJoinExec and DistanceJoinExec [sedona]

Reply via email to