tobixdev opened a new issue, #17904:
URL: https://github.com/apache/datafusion/issues/17904

   ### Is your feature request related to a problem or challenge?
   
   If a subexpression in a join filter expression can be evaluated on a single 
side of an NL Join, we could push down this subexpression to the respective 
side. We do not do this for all joins because often evaluating the expression 
after any equi-joins is more performant. However,  the NL Join doesn't have a 
huge performance benefit from equi-joins and so it could benefit from such an 
approach.
   
   In theory, this should improve performance as the subexpression does not 
need to be evaluated for the cartesian product in the join. In practice the 
question is whether this optimization is worth its additional complexity. I 
think it is worth it for complex UDFs, but for simple ones not so much. 
Initially, we built this optimization in [RDF 
Fusion](https://github.com/tobixdev/rdf-fusion) and as I've now a bit more time 
on my hand I could try to upstream it. There we have some costly UDFs and see 
significant improvements for some queries (~60 %).
   
   Example: 
   
   Before:
   ```
   NestedLoopJoinExec: join_type=LeftSemi, filter=complex_udf(lhs.a) > rhs.b, 
projection=[rhs.b@1]
         <lhs>
        <rhs>
   ```
   
   After:
   ```
   NestedLoopJoinExec: join_type=LeftSemi, filter= join_proj_push_down_1 > 
rhs.b, projection=[rhs.b@1]
         ProjectionExec: expr=[a@0 as a, complex_udf(lhs.a) as 
join_proj_push_down_1]
              <lhs>
        <rhs>
   ```
   
   ### Describe the solution you'd like
   
   Discuss whether we want to include this optimization in DataFusion. 
   
   ### Describe alternatives you've considered
   
   Some alternative thoughts:
   
   - Just don't include the optimization
   - Include it, but not in the default Optimizer (Do we have something like 
that?)
   - We could also implement this as an optimization in the NL Join stream 
itself. However, this makes the join more complicated.
   - Implement it as logical plan optimization that checks whether the Join 
will become an NL join (duplicates planning logic)
   
   ### Additional context
   
   I'll create a draft PR that would upstream the optimization pass to 
DataFusion. The optimization changes the execution plan of two TPC-H queries, 
but I doubt that we will see much gains there.
   
   I am not so sure about the naming of this optimization as it could be mixed 
up with regular projection push down. Feel free to change it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to