clubycoder opened a new pull request, #39911:
URL: https://github.com/apache/spark/pull/39911

   This is an alternate approach to PR's previously submitted for - 
https://issues.apache.org/jira/browse/SPARK-36478
   
   
   ### What changes were proposed in this pull request?
   This PR adds an additional Join optimizer that will eliminate a left-outer 
join that is under a Project where the columns needed are on all the right side 
of the join.  In this case the join is removed and the right child of the Join 
is put directly under the Project.
   
   
   ### Why are the changes needed?
   This optimization is meant to optimize queries that don't depend on a 
left-outer join that has been applied to an upstream view.  The use-case is a 
group has standardized their analysis around a view that includes one or more 
left-outer joins to bring in additional columns that enhance the core data.  
This view has been saved off and is now a blackbox for multiple downstream 
queries that are required to use this view for standardization.  Many of the 
queries on this view will not depend on columns coming from the view's 
left-joins.  If this joined columns are not being used, we should avoid doing 
the join and reading the joined dataset.
   
   For example we would like: ```
   == Optimized Logical Plan ==
   Project [timestamp#16L, customerId#17]
   +- Join LeftOuter, (productId#18 = productId_products_productId#64)
      :- Project [timestamp#16L, customerId#17, productId#18]
      :  +- Join LeftOuter, (customerId#17 = customerId_customers_customerId#58)
      :     :- LocalRelation [timestamp#16L, customerId#17, productId#18]
      :     +- LocalRelation [customerId_customers_customerId#58]
      +- LocalRelation [productId_products_productId#64]
   ```
   to become: ```
   == Optimized Logical Plan ==
   LocalRelation [timestamp#16L, customerId#17]
   ```
   because the columns that come from the joins are not being used.
   
   **NOTE**: The removal of the redundant Project(s) is handled by other 
existing optimizers.
   
   ### Does this PR introduce _any_ user-facing change?
   The only use-facing change would be the explain of a query that matches the 
plan optimization.  The optimized plan will have the left-outer join removed if 
the join is not used by the wrapping project.
   
   
   ### How was this patch tested?
   Additional tests added to JoinOptimizationSuite covering plans that match 
and don't match the optimization as well as tested manually.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to