petern48 commented on issue #2570: URL: https://github.com/apache/sedona/issues/2570#issuecomment-3700841643
Hmm, ok. So we can get the matches easily with the traditional spatial join, but the challenge lies in finding all of the non-matches on the left side in a scalable way (i.e. we can't use the same approach implemented in `BroadcastIndexJoinExec`). Spark traditionally will try a SortMergeJoin if the join key is linearly sortable, but spatial data is not, so that doesn't work for us. It instead falls back on a `BroadcastNestedLoopJoinExec`. I think at a theoretical level, there isn't a better way to extend support to outer joins aside from the workaround mentioned in the [PR](https://github.com/apache/sedona/pull/2561): post-filter the results after the spatial inner join. I think it would technicall possible to implement that natively in the query optimizer, but two complications come to mind: 1. We can't assume an id already exists in the table (which is needed to determine if there are no matches across all partitions. We could technically first generate a [monotonically_increasing_id()](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.monotonically_increasing_id.html) column, though that would be expensive too (sorting required). Not sure how easy it would be to auto-detect an unique id column from code. 2. This would be a logical plan rewrite based on the physical plan, since we don't want to modify behavior for cases small enough to use `BroadcastIndexJoinExec`. ``` Generate PhysicalPlan from LogicalPlan (normal behavior) If JoinExec is not BroadcastIndexJoinExec, Then perform the above logical plan rewrite. Then, go back and regenerate the physical plan from the new Logical Plan ``` Anyways, I think that implementation is more complicated than it is worth, for now. Didn't think about it too much when I first created the issue. Closing. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
