[ 
https://issues.apache.org/jira/browse/SPARK-41416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41416:
------------------------------------

    Assignee:     (was: Apache Spark)

> Rewrite self join in in predicate to aggregate
> ----------------------------------------------
>
>                 Key: SPARK-41416
>                 URL: https://issues.apache.org/jira/browse/SPARK-41416
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.4.0
>            Reporter: Wan Kun
>            Priority: Major
>
> Transforms the SelfJoin resulting in duplicate rows used for IN predicate to 
> aggregation.
> For IN predicate, duplicate rows does not have any value. It will be overhead.
> Ex: TPCDS Q95: following CTE is used only in IN predicates for only one 
> column comparison ({@code ws_order_number}).
> This results in exponential increase in Joined rows with too many duplicate 
> rows.
> {code:java}
> WITH ws_wh AS
> (
>        SELECT ws1.ws_order_number,
>               ws1.ws_warehouse_sk wh1,
>               ws2.ws_warehouse_sk wh2
>        FROM   web_sales ws1,
>               web_sales ws2
>        WHERE  ws1.ws_order_number = ws2.ws_order_number
>        AND    ws1.ws_warehouse_sk <> ws2.ws_warehouse_sk)
> {code}
> Could be optimized as below:
> {code:java}
> WITH ws_wh AS
>     (SELECT ws_order_number
>       FROM  web_sales
>       GROUP BY ws_order_number
>       HAVING COUNT(DISTINCT ws_warehouse_sk) > 1)
> {code}
> Optimized CTE scans table only once and results in unique rows.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to