liujiayi771 opened a new issue, #5350:
URL: https://github.com/apache/incubator-gluten/issues/5350

   ### Description
   
   The problem encountered in developing join support pull-out pre-project.
   * If we handle the join in the physical plan, there are mainly two problems.
     * (**Unresolved**) Handling `BroadcastHashJoin`'s `BroadcastExchange` is 
tricky. If the child of the pre-project is a `BroadcastExchange`, it needs to 
implement the `doExecuteBroadcast` method. If it eventually transforms into a 
`ProjectExecTransformer`, we can simply add `doExecuteBroadcast = 
child.doExecuteBroadcast` to `ProjectExecTransformer`. However, if 
`enableColumnarProject` is set to false, this Project will ultimately become a 
Spark `ProjectExec`, which does not have a `doExecuteBroadcast` method, making 
it impossible to insert the pre-`ProjectExec` into the parent of the 
`BroadcastExchange`.
     * (**Resolved**) The `SortOrder` obtained for `requiredOrders` in 
`SortMergeJoin` is based on the keys for sorting. However, the keys are 
rewritten during the pull-out process, resulting in keys like `pre_0`/`pre_1` 
(where `pre_0` = `Alias(join key1 expression).toAttribute`, `pre_1` = 
`Alias(join key2 expression).toAttribute`). The inserted `SortExec` by Spark is 
based on the original keys, leading the existing code to believe that it 
doesn't satisfy the required order for `SortMergeJoin`, and will insert an 
additional `SortExec` based on `pre_0`/`pre_1` (which is unnecessary because 
being sorted based on `pre_0`/`pre_1` is the same as being sorted based on the 
original join keys). This issue is similar to the previous pull-out of 
`SortExec`, and the additionally inserted `SortExec` needs to be eliminated. 
It's necessary to make Spark understand that being sorted based on the original 
join keys is the same as being sorted based on `pre_0`/`pre_1`. Similar to the 
previous scenari
 o, this can also be accomplished by leveraging `sameOrderExpression` in 
`SortOrder`.
   
   * If handling join in the logical plan, it is relatively straightforward, 
and there is no need to consider issues such as `BroadcastExchange` or `Sort`. 
There are two main problems that arise:
     *  (**Can be resolved**) Spark has an optimization called 
`HashJoin.rewriteKeyExpr`, which rewrites join keys into a long value before 
join. If this optimization is enabled, it necessitates the use of a 
pre-project. This can be detected in the logical plan, and if enabled, the 
rewritten join keys can be computed in advance. The condition can then be 
replaced, with the pre-project containing the original output columns of both 
left and right, as well as the rewritten column. Finally, post-project can be 
used to revert back to the original left and right outputs.
     * (**Unresolved**) If the condition of the join contains `EqualNullSafe`, 
it will generate expressions such as `Coalesce` and `IsNull` when converted to 
a physical plan. These need to be pulled out as well, but it is not possible to 
pull them out in the logical plan.
     *  (**Unresolved**) It will affect DPP. In situations where subquery reuse 
was originally possible, the addition of the pre-project leads to a mismatch in 
the outputs, resulting in reuse failure.
     
![image](https://github.com/apache/incubator-gluten/assets/13622031/6b0a0c91-2c54-4092-bdcb-883dcf5f655d)
     
![image](https://github.com/apache/incubator-gluten/assets/13622031/2b08a2d5-1e69-4c5a-8941-83bf9158d2ab)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to