my-ship-it opened a new issue, #659:
URL: https://github.com/apache/cloudberry/issues/659
### Cloudberry Database version
main
### What happened
Optimizer could produce better plan to dedup based on cost.
### What you think should happen instead
_No response_
### How to reproduce
For the planner, we need to create better plan which uses HashAggregate to
dedup columns.
For example,
```
ebi_dwh=# explain analyze WITH UniqueIDs AS (
SELECT DISTINCT gsp_id
FROM eft_com_call_statistic
)
SELECT COUNT(*)
FROM UniqueIDs;
QUERY
PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=0.00..1419.47 rows=1 width=8) (actual
time=2510.079..2510.081 rows=1 loops=1)
-> Gather Motion 4:1 (slice1; segments: 4) (cost=0.00..1419.47 rows=1
width=8) (actual time=1998.786..2510.054 rows=4 loops=1)
-> Partial Aggregate (cost=0.00..1419.47 rows=1 width=8) (actual
time=1993.802..1993.803 rows=1 loops=1)
-> HashAggregate (cost=0.00..1419.47 rows=2315005 width=1)
(actual time=1294.980..1855.013 rows=2339996 loops=1)
Group Key: gsp_id
-> Seq Scan on eft_com_call_statistic
(cost=0.00..969.33 rows=2375680 width=37) (actual time=0.343..339.551
rows=2377251 loops=1)
Planning Time: 5.357 ms
(slice0) Executor memory: 98354K bytes.
(slice1) Executor memory: 244665K bytes avg x 4x(0) workers, 244755K
bytes max (seg2). Work_mem: 286737K bytes max.
Memory used: 2097152kB
Optimizer: Pivotal Optimizer (GPORCA)
Execution Time: 2584.618 ms
(12 rows)
```
This plan uses HashAggregate to perform gsp_id deduplication, while
```
ebi_dwh=# explain analyze select count(distinct gsp_id) from
eft_com_call_statistic;
QUERY
PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate (cost=0.00..1142.67 rows=1 width=8) (actual
time=10573.416..10573.418 rows=1 loops=1)
-> Gather Motion 4:1 (slice1; segments: 4) (cost=0.00..1142.67 rows=1
width=8) (actual time=10205.645..10573.387 rows=4 loops=1)
-> Partial Aggregate (cost=0.00..1142.67 rows=1 width=8) (actual
time=10437.005..10437.007 rows=1 loops=1)
-> Seq Scan on eft_com_call_statistic (cost=0.00..969.33
rows=2375680 width=37) (actual time=0.269..548.673 rows=2377251 loops=1)
Planning Time: 4.608 ms
(slice0) Executor memory: 37K bytes.
(slice1) Executor memory: 148480K bytes avg x 4x(0) workers, 148579K
bytes max (seg2).
Memory used: 2097152kB
Optimizer: Pivotal Optimizer (GPORCA)
Execution Time: 10573.853 ms
(10 rows)
```
It seems the cost estimation is no accurate.
### Operating System
No specific
### Anything else
No
### Are you willing to submit PR?
- [ ] Yes, I am willing to submit a PR!
### Code of Conduct
- [X] I agree to follow this project's [Code of
Conduct](https://cloudberrydb.org/community/coc).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]