[PR] [SPARK-43025][SQL] Eliminate Union if filters have the same child plan [spark]

via GitHub Tue, 16 Apr 2024 23:46:56 -0700


beliefer opened a new pull request, #40661:
URL: https://github.com/apache/spark/pull/40661


   ### What changes were proposed in this pull request?
   There are a lot of SQL with union multiple subquery with filter in user 
scenarios. Take an example,
   **q1**
   ```
   SELECT ss_item_sk, ss_ticket_number, ss_customer_sk
   FROM store_sales
        LEFT JOIN store_returns
        ON sr_item_sk = ss_item_sk AND sr_ticket_number = ss_ticket_number
   WHERE sr_return_amt > 10000
   UNION ALL
   SELECT ss_item_sk, ss_ticket_number, ss_customer_sk
   FROM store_sales
        LEFT JOIN store_returns
        ON sr_item_sk = ss_item_sk AND sr_ticket_number = ss_ticket_number
   WHERE sr_return_amt < 1000
   ```
   In fact, we can simplify this SQL as
   ```
   SELECT ss_item_sk, ss_ticket_number, ss_customer_sk
   FROM store_sales
        LEFT JOIN store_returns
        ON sr_item_sk = ss_item_sk
                AND sr_ticket_number = ss_ticket_number
   WHERE sr_return_amt > 10000 OR sr_return_amt < 1000
   ```
   **q2**
   ```
   SELECT ss_item_sk, ss_ticket_number, ss_customer_sk
   FROM store_sales
   WHERE ss_ext_discount_amt > 1000
   UNION ALL
   SELECT ss_item_sk, ss_ticket_number, ss_customer_sk
   FROM store_sales
   WHERE ss_ext_discount_amt < 100
   ```
   In fact, we can simplify this SQL as
   ```
   SELECT ss_item_sk, ss_ticket_number, ss_customer_sk
   FROM store_sales
   WHERE ss_ext_discount_amt > 1000 OR ss_ext_discount_amt < 100
   ```
   
   This PR optimizes `Union` operators if the children exists at least two 
`Filter` by:
   1. Eliminate `Union` operators if all the children are `Filter` and all the 
child of these `Filter`s are same. We just need merging the predicates into one 
single predicate by connecting these `Filter`s with `Or`.
   2. Combines multiple `Filter` operators into one if all the child of these 
`Filter`s are same. We just need merging the predicates into one single 
predicate by connecting these `Filter`s with `Or` too.
   
   ### Why are the changes needed?
   Simply the SQL plan and improve the performance.
   
   
   ### Does this PR introduce _any_ user-facing change?
   'No'.
   New feature and just update the inner implementation.
   
   
   ### How was this patch tested?
   New test cases.
   The micro benchmark for q1 and q2.
   Before this PR
   ```
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_311-b11 on Mac OS X 10.16
   Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
   TPCDS Snappy:                             Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   
------------------------------------------------------------------------------------------------------------------------
   q1                                               51569          52030        
 653          0.6        1627.9       1.0X
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_311-b11 on Mac OS X 10.16
   Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
   TPCDS Snappy:                             Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   
------------------------------------------------------------------------------------------------------------------------
   q2                                                4255           4287        
  45          6.8         147.7       1.0X
   ```
   After this PR
   ```
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_311-b11 on Mac OS X 10.16
   Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
   TPCDS Snappy:                             Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   
------------------------------------------------------------------------------------------------------------------------
   q1.                                              46806          47462        
 929          0.7        1477.5       1.0X
   
   Java HotSpot(TM) 64-Bit Server VM 1.8.0_311-b11 on Mac OS X 10.16
   Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
   TPCDS Snappy:                             Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
   
------------------------------------------------------------------------------------------------------------------------
   q2.                                                2655           2674       
   28         10.8          92.2       1.0X
   
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-43025][SQL] Eliminate Union if filters have the same child plan [spark]

Reply via email to