alamb opened a new issue, #9056: URL: https://github.com/apache/arrow-datafusion/issues/9056
### Is your feature request related to a problem or challenge? TLDR is that it would be nice to more easily use DataFusion as an execution engine for engines like Spark (e.g. the comet project https://github.com/apache/arrow-datafusion-comet/pull/1), where operators directly take general expressions as join keys. As @viirya said on https://github.com/apache/arrow-datafusion/pull/8991: Currently the join keys of join operators like `SortMergeJoin` are restricted to be `Column`. But it is commonly we use expressions (e.g., `l_col + 1 = r_col + 2`) other than simply columns as join keys. From the query plan, DataFusion seems to add additional `Project` under join operator which projects the expressions into columns. So the above join operators take join keys as columns. However, in other query engines, e.g., Spark, its query plan doesn't have the additional projection but its join operators directly take general expressions as join keys. (note that by adding additional projection before join in Spark it means more data to be shuffled/sorted which can be bad for performance) That means if we cannot delegate such join operators to DataFusion physical join operators which require join keys must be columns. This patch tries to relax this join keys constraint of physical join operators. So we can construct DataFusion physical join operator using general expressions as join keys. This patch doesn't change how DataFusion plans the join operators. I.e., DataFusion still plans a join operation using non-column join keys into projection + join operator with columns. (We probably can remove this additional projection later if it also adds additional cost to DataFusion. Currently I'm not sure if/how DataFusion plans partitioning for the join operators.) ### Describe the solution you'd like _No response_ ### Describe alternatives you've considered We could potentially still require joins to take only columns, and require other parts of the system to insert `ProjectionExec`s In fact it seems like It seems like substrait represents equi joins as columns (not expressions): https://substrait.io/relations/physical_relations/#hash-equijoin-properties It might be possible for engines like `comet` to insert `ProjectionExec` in the appropriate places in the plan using an optimizer pass For example, ``` HashJoinExec(exprs=(l_col + 1, r_col + 2)) ``` Could be rewritten to ``` HashJoinExec(exprs=(x, y)) ProjectionExec(exprs=[l_col + 1 as "x", r_col + 2 as "y"]) ``` ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
