korowa opened a new pull request, #2591:
URL: https://github.com/apache/arrow-datafusion/pull/2591

   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and 
enhancements and this helps us generate change logs for our releases. You can 
link an issue to this PR using the GitHub syntax. For example `Closes #123` 
indicates that this PR will close issue #123.
   -->
   
   Closes #2509 , #2496 .
   
    # Rationale for this change
   <!--
    Why are you proposing this change? If this is already explained clearly in 
the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your 
changes and offer better suggestions for fixes.  
   -->
   
   Support filters in `JOIN ON` SQL clause
   
   # What changes are included in this PR?
   <!--
   There is no need to duplicate the description in the issue here but it is 
sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   **Logical plan**
   
   Join logical plan now supports Option<Expr> - filter which should be applied 
to "equijoined" data. Join planning logic left almost untouched:
   
   - `Inner` still planned as Join -> Filter (it allows proper filter pushdown)
   - in case of `Left` / `Right` planner still pushes down predicates relates 
only to inner join input, and now it allows predicates based on outer input
   - `Full` allows predicates in ON clause
   
   **Physical plan**
   
   Now, after building left/right indices vectors as a result of equijoin part 
of `ON` clause, HashJoin applies filter expression (if any has been provided) 
to batch of rows with according indices and produce new vectors with indices of 
joined rows after filtering. Intermediate batch contains only required for 
filter  expression columns.
   
   HashJoin physical plan contains Option<JoinFilter> struct which encapsulates 
all necessary data to create intermediate batch and apply filter: 
   - physical expression - filter expression itself
   - column indices - stores indices and join sides on columns included in 
intermediate batch
   - schema - intermediate batch schema
   
   # Are there any user-facing changes?
   <!--
   If there are user-facing changes then we may require documentation to be 
updated before approving the PR.
   -->
   
   Plan builder and DF join methods now require optional expression as an 
argument.
   
   <!--
   If there are any breaking changes to public APIs, please add the `api 
change` label.
   -->
   
   Plan builder and DF join methods now require optional expression as an 
argument.
   
   # Does this PR break compatibility with Ballista?
   
   Yes - new fields added to both logical and physical plan join nodes.
   Related PR - https://github.com/apache/arrow-ballista/pull/36
   
   <!--
   The CI checks will attempt to build 
[arrow-ballista](https://github.com/apache/arrow-ballista) against this PR. If 
   this check fails then it indicates that this PR makes a breaking change to 
the DataFusion API.
   
   If possible, try to make the change in a way that is not a breaking API 
change. For example, if code has moved 
    around, try adding `pub use` from the original location to preserve the 
current API.
   
   If it is not possible to avoid a breaking change (such as when adding enum 
variants) then follow this process:
   
   - Make a corresponding PR against `arrow-ballista` with the changes required 
there
   - Update `dev/build-arrow-ballista.sh` to clone the appropriate 
`arrow-ballista` repo & branch
   - Merge this PR when CI passes
   - Merge the Ballista PR
   - Create a new PR here to reset `dev/build-arrow-ballista.sh` to point to 
`arrow-ballista` master again
   
   _If you would like to help improve this process, please see 
https://github.com/apache/arrow-datafusion/issues/2583_
   -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to