Martin Andersson created SEDONA-641: ---------------------------------------
Summary: Add optimizer for non broadcast outer joins Key: SEDONA-641 URL: https://issues.apache.org/jira/browse/SEDONA-641 Project: Apache Sedona Issue Type: Improvement Reporter: Martin Andersson Sedona doesn't optimize non broadcast outer joins. That has some surprising effects for inexperienced Sedona users. * Changing a join from inner to outer makes Spark fall back to a nested loop - grinding to a halt. * If an automatically broadcasted outer join grows above the broadcast threashold the plan changes from a Sedona optimized broadcast join to a Spark nested loop. To work around the lack of outer joins they have to be implemented by the user. Usually by adding a uuid to the datasets, join geospatially with an inner join and then join with the datasets again using an outer join with the uuids. The user implemented outer join obfuscates the business logic. Implementing outer join as efficiently as inner join is hard. It might be worth adding an optimizer for outer joins even if it's not ideal. It could be implemented with uuids, an inner geospatial join and an outer join using the uuids. That would be significantly slower than the inner join optimizer but would be way better than Sparks nested loop. If I remember correctly this is how an early version of geospark did deduplication before the current deduplication logic was implemented. -- This message was sent by Atlassian Jira (v8.20.10#820010)