beliefer opened a new pull request, #40661:
URL: https://github.com/apache/spark/pull/40661
### What changes were proposed in this pull request?
There are a lot of SQL with union multiple subquery with filter in user
scenarios. Take an example,
**q1**
```
SELECT ss_item_sk, ss_ticket_number, ss_customer_sk
FROM store_sales
LEFT JOIN store_returns
ON sr_item_sk = ss_item_sk AND sr_ticket_number = ss_ticket_number
WHERE sr_return_amt > 10000
UNION ALL
SELECT ss_item_sk, ss_ticket_number, ss_customer_sk
FROM store_sales
LEFT JOIN store_returns
ON sr_item_sk = ss_item_sk AND sr_ticket_number = ss_ticket_number
WHERE sr_return_amt < 1000
```
In fact, we can simplify this SQL as
```
SELECT ss_item_sk, ss_ticket_number, ss_customer_sk
FROM store_sales
LEFT JOIN store_returns
ON sr_item_sk = ss_item_sk
AND sr_ticket_number = ss_ticket_number
WHERE sr_return_amt > 10000 OR sr_return_amt < 1000
```
**q2**
```
SELECT ss_item_sk, ss_ticket_number, ss_customer_sk
FROM store_sales
WHERE ss_ext_discount_amt > 1000
UNION ALL
SELECT ss_item_sk, ss_ticket_number, ss_customer_sk
FROM store_sales
WHERE ss_ext_discount_amt < 100
```
In fact, we can simplify this SQL as
```
SELECT ss_item_sk, ss_ticket_number, ss_customer_sk
FROM store_sales
WHERE ss_ext_discount_amt > 1000 OR ss_ext_discount_amt < 100
```
This PR optimizes `Union` operators if the children exists at least two
`Filter` by:
1. Eliminate `Union` operators if all the children are `Filter` and all the
child of these `Filter`s are same. We just need merging the predicates into one
single predicate by connecting these `Filter`s with `Or`.
2. Combines multiple `Filter` operators into one if all the child of these
`Filter`s are same. We just need merging the predicates into one single
predicate by connecting these `Filter`s with `Or` too.
### Why are the changes needed?
Simply the SQL plan and improve the performance.
### Does this PR introduce _any_ user-facing change?
'No'.
New feature and just update the inner implementation.
### How was this patch tested?
New test cases.
The micro benchmark for q1 and q2.
Before this PR
```
Java HotSpot(TM) 64-Bit Server VM 1.8.0_311-b11 on Mac OS X 10.16
Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
TPCDS Snappy: Best Time(ms) Avg Time(ms)
Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
q1 51569 52030
653 0.6 1627.9 1.0X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_311-b11 on Mac OS X 10.16
Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
TPCDS Snappy: Best Time(ms) Avg Time(ms)
Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
q2 4255 4287
45 6.8 147.7 1.0X
```
After this PR
```
Java HotSpot(TM) 64-Bit Server VM 1.8.0_311-b11 on Mac OS X 10.16
Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
TPCDS Snappy: Best Time(ms) Avg Time(ms)
Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
q1. 46806 47462
929 0.7 1477.5 1.0X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_311-b11 on Mac OS X 10.16
Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz
TPCDS Snappy: Best Time(ms) Avg Time(ms)
Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
q2. 2655 2674
28 10.8 92.2 1.0X
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]