Martin Andersson created SEDONA-641:
---------------------------------------

             Summary: Add optimizer for non broadcast outer joins
                 Key: SEDONA-641
                 URL: https://issues.apache.org/jira/browse/SEDONA-641
             Project: Apache Sedona
          Issue Type: Improvement
            Reporter: Martin Andersson


Sedona doesn't optimize non broadcast outer joins. That has some surprising 
effects for inexperienced Sedona users.
 
* Changing a join from inner to outer makes Spark fall back to a nested loop - 
grinding to a halt.
* If an automatically broadcasted outer join grows above the broadcast 
threashold the plan changes from a Sedona optimized broadcast join to a Spark 
nested loop.
 
To work around the lack of outer joins they have to be implemented by the user. 
Usually by adding a uuid to the datasets, join geospatially with an inner join 
and then join with the datasets again using an outer join with the uuids. The 
user implemented outer join obfuscates the business logic.
 
Implementing outer join as efficiently as inner join is hard. It might be worth 
adding an optimizer for outer joins even if it's not ideal. It could be 
implemented with uuids, an inner geospatial join and an outer join using the 
uuids. That would be significantly slower than the inner join optimizer but 
would be way better than Sparks nested loop. If I remember correctly this is 
how an early version of geospark did deduplication before the current 
deduplication logic was implemented.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to