[I] Join support pull out pre-project [incubator-gluten]

via GitHub Tue, 09 Apr 2024 20:13:43 -0700


liujiayi771 opened a new issue, #5350:
URL: https://github.com/apache/incubator-gluten/issues/5350

### Description

The problem encountered in developing join support pull-out pre-project.
* If we handle the join in the physical plan, there are mainly two problems.
* (**Unresolved**) Handling `BroadcastHashJoin`'s `BroadcastExchange` is
tricky. If the child of the pre-project is a `BroadcastExchange`, it needs to
implement the `doExecuteBroadcast` method. If it eventually transforms into a
`ProjectExecTransformer`, we can simply add `doExecuteBroadcast =
child.doExecuteBroadcast` to `ProjectExecTransformer`. However, if
`enableColumnarProject` is set to false, this Project will ultimately become a
Spark `ProjectExec`, which does not have a `doExecuteBroadcast` method, making
it impossible to insert the pre-`ProjectExec` into the parent of the
`BroadcastExchange`.
* (**Resolved**) The `SortOrder` obtained for `requiredOrders` in
`SortMergeJoin` is based on the keys for sorting. However, the keys are
rewritten during the pull-out process, resulting in keys like `pre_0`/`pre_1`
(where `pre_0` = `Alias(join key1 expression).toAttribute`, `pre_1` =
`Alias(join key2 expression).toAttribute`). The inserted `SortExec` by Spark is
based on the original keys, leading the existing code to believe that it
doesn't satisfy the required order for `SortMergeJoin`, and will insert an
additional `SortExec` based on `pre_0`/`pre_1` (which is unnecessary because
being sorted based on `pre_0`/`pre_1` is the same as being sorted based on the
original join keys). This issue is similar to the previous pull-out of
`SortExec`, and the additionally inserted `SortExec` needs to be eliminated.
It's necessary to make Spark understand that being sorted based on the original
join keys is the same as being sorted based on `pre_0`/`pre_1`. Similar to the
previous scenari
o, this can also be accomplished by leveraging `sameOrderExpression` in
`SortOrder`.

* If handling join in the logical plan, it is relatively straightforward,
and there is no need to consider issues such as `BroadcastExchange` or `Sort`.
There are two main problems that arise:
* (**Can be resolved**) Spark has an optimization called
`HashJoin.rewriteKeyExpr`, which rewrites join keys into a long value before
join. If this optimization is enabled, it necessitates the use of a
pre-project. This can be detected in the logical plan, and if enabled, the
rewritten join keys can be computed in advance. The condition can then be
replaced, with the pre-project containing the original output columns of both
left and right, as well as the rewritten column. Finally, post-project can be
used to revert back to the original left and right outputs.
* (**Unresolved**) If the condition of the join contains `EqualNullSafe`,
it will generate expressions such as `Coalesce` and `IsNull` when converted to
a physical plan. These need to be pulled out as well, but it is not possible to
pull them out in the logical plan.
* (**Unresolved**) It will affect DPP. In situations where subquery reuse
was originally possible, the addition of the pre-project leads to a mismatch in
the outputs, resulting in reuse failure.

![image](https://github.com/apache/incubator-gluten/assets/13622031/6b0a0c91-2c54-4092-bdcb-883dcf5f655d)

![image](https://github.com/apache/incubator-gluten/assets/13622031/2b08a2d5-1e69-4c5a-8941-83bf9158d2ab)

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Join support pull out pre-project [incubator-gluten]

Reply via email to