my-ship-it opened a new issue, #659:
URL: https://github.com/apache/cloudberry/issues/659

   ### Cloudberry Database version
   
   main
   
   ### What happened
   
   Optimizer could produce better plan to dedup based on cost.
   
   ### What you think should happen instead
   
   _No response_
   
   ### How to reproduce
   
   For the planner, we need to create better plan which uses HashAggregate to 
dedup columns.
   For example,
   ```
   ebi_dwh=# explain analyze WITH UniqueIDs AS (
       SELECT DISTINCT gsp_id
       FROM eft_com_call_statistic
   )
   SELECT COUNT(*)
   FROM UniqueIDs;
                                                                          QUERY 
PLAN
   
---------------------------------------------------------------------------------------------------------------------------------------------------------
    Finalize Aggregate  (cost=0.00..1419.47 rows=1 width=8) (actual 
time=2510.079..2510.081 rows=1 loops=1)
      ->  Gather Motion 4:1  (slice1; segments: 4)  (cost=0.00..1419.47 rows=1 
width=8) (actual time=1998.786..2510.054 rows=4 loops=1)
            ->  Partial Aggregate  (cost=0.00..1419.47 rows=1 width=8) (actual 
time=1993.802..1993.803 rows=1 loops=1)
                  ->  HashAggregate  (cost=0.00..1419.47 rows=2315005 width=1) 
(actual time=1294.980..1855.013 rows=2339996 loops=1)
                        Group Key: gsp_id
                        ->  Seq Scan on eft_com_call_statistic  
(cost=0.00..969.33 rows=2375680 width=37) (actual time=0.343..339.551 
rows=2377251 loops=1)
    Planning Time: 5.357 ms
      (slice0)    Executor memory: 98354K bytes.
      (slice1)    Executor memory: 244665K bytes avg x 4x(0) workers, 244755K 
bytes max (seg2).  Work_mem: 286737K bytes max.
    Memory used:  2097152kB
    Optimizer: Pivotal Optimizer (GPORCA)
    Execution Time: 2584.618 ms
   (12 rows)
   
   ```
   
   This plan uses HashAggregate to perform gsp_id deduplication, while
   
   ```
    ebi_dwh=# explain analyze select count(distinct gsp_id) from 
eft_com_call_statistic;
                                                                       QUERY 
PLAN
   
---------------------------------------------------------------------------------------------------------------------------------------------------
    Finalize Aggregate  (cost=0.00..1142.67 rows=1 width=8) (actual 
time=10573.416..10573.418 rows=1 loops=1)
      ->  Gather Motion 4:1  (slice1; segments: 4)  (cost=0.00..1142.67 rows=1 
width=8) (actual time=10205.645..10573.387 rows=4 loops=1)
            ->  Partial Aggregate  (cost=0.00..1142.67 rows=1 width=8) (actual 
time=10437.005..10437.007 rows=1 loops=1)
                  ->  Seq Scan on eft_com_call_statistic  (cost=0.00..969.33 
rows=2375680 width=37) (actual time=0.269..548.673 rows=2377251 loops=1)
    Planning Time: 4.608 ms
      (slice0)    Executor memory: 37K bytes.
      (slice1)    Executor memory: 148480K bytes avg x 4x(0) workers, 148579K 
bytes max (seg2).
    Memory used:  2097152kB
    Optimizer: Pivotal Optimizer (GPORCA)
    Execution Time: 10573.853 ms
   (10 rows)
   ```
   
   It seems the cost estimation is no accurate.
   
   ### Operating System
   
   No specific
   
   ### Anything else
   
   No
   
   ### Are you willing to submit PR?
   
   - [ ] Yes, I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://cloudberrydb.org/community/coc).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to