askoa commented on issue #4795:
URL:
https://github.com/apache/arrow-datafusion/issues/4795#issuecomment-1380647966
I found part of the problem. The rule `common_sub_expression_eliminate`
swallows one of the `EXIST (<subquery>)`. Below are detailed steps
logical plan before `decorrelate_where_exists` -> `... EXISTS(<subquery>)
AND (EXISTS(<subquery>) OR EXISTS(<subquery>)) .... subquery: ... subquery: ...
subquery: ...`
the rule `decorrelate_where_exists` mutates the first `EXIST` into a
`LeftSemi Join`
logical plan after `decorrelate_where_exists ` and before
`common_sub_expression_eliminate` -> `...(EXISTS(<subquery>) OR
EXISTS(<subquery>)) ... subquery: ... subquery: ... LeftSemi Join: ...`
the rule `common_sub_expression_eliminate` runs:
- The rule maintains a HashMap of expressions to no. of occurrences. The
expression `EXISTS(<subquery>)` is recognized as a single expression occurring
twice.
https://github.com/apache/arrow-datafusion/blob/1844d39eb92f04e483095f491ff07da3a2f67f25/datafusion/optimizer/src/common_subexpr_eliminate.rs#L451-L454
- The rule creates a projection that merges the two similar expressions
and thus making the second subquery irrelevant.
logical plan after `common_sub_expression_eliminate`:
`...(EXISTS(<subquery>) OR EXISTS(<subquery>)) ... Projection:
EXISTS<subquery>... subquery: ... LeftSemi Join: ...`
As I am new to both sql optimization and to data fusion I would like to use
this thread to discuss potential solutions. I assume adding unique identifiers
or aliases to subqueries might be part of the solution. Adding unique
identifiers or aliases will stop `common_sub_expression_elimitate` treating
both subqueries as the same.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]