On Mon, Nov 6, 2017 at 1:16 AM, David Rowley <david.row...@2ndquadrant.com> wrote:
> In [1] we made a change to process the GROUP BY clause to remove any > group by items that are functionally dependent on some other GROUP BY > items. > > This really just checks if a table's PK columns are entirely present > in the GROUP BY clause and removes anything else belonging to that > table. > > All this seems to work well, but I totally failed to consider that the > exact same thing applies to DISTINCT too. > > Over in [2], Rui Liu mentions that the planner could do a better job > for his case. > > Using Rui Liu's example: > > CREATE TABLE test_tbl ( k INT PRIMARY KEY, col text); > INSERT into test_tbl select generate_series(1,10000000), 'test'; > > Master: > > postgres=# explain analyze verbose select distinct col, k from > test_tbl order by k limit 1000; > QUERY > PLAN > ------------------------------------------------------------ > ------------------------------------------------------------ > ------------------------- > Limit (cost=1658556.19..1658563.69 rows=1000 width=9) (actual > time=8934.962..8935.495 rows=1000 loops=1) > Output: col, k > -> Unique (cost=1658556.19..1733557.50 rows=10000175 width=9) > (actual time=8934.961..8935.460 rows=1000 loops=1) > Output: col, k > -> Sort (cost=1658556.19..1683556.63 rows=10000175 width=9) > (actual time=8934.959..8935.149 rows=1000 loops=1) > Output: col, k > Sort Key: test_tbl.k, test_tbl.col > Sort Method: external merge Disk: 215128kB > -> Seq Scan on public.test_tbl (cost=0.00..154056.75 > rows=10000175 width=9) (actual time=0.062..1901.728 rows=10000000 > loops=1) > Output: col, k > Planning time: 0.092 ms > Execution time: 8958.687 ms > (12 rows) > > Patched: > > postgres=# explain analyze verbose select distinct col, k from > test_tbl order by k limit 1000; > > QUERY PLAN > ------------------------------------------------------------ > ------------------------------------------------------------ > ---------------------------------- > Limit (cost=0.44..34.31 rows=1000 width=9) (actual time=0.030..0.895 > rows=1000 loops=1) > Output: col, k > -> Unique (cost=0.44..338745.50 rows=10000175 width=9) (actual > time=0.029..0.814 rows=1000 loops=1) > Output: col, k > -> Index Scan using test_tbl_pkey on public.test_tbl > (cost=0.44..313745.06 rows=10000175 width=9) (actual time=0.026..0.452 > rows=1000 loops=1) > Output: col, k > Planning time: 0.152 ms > Execution time: 0.985 ms > (8 rows) > > A patch to implement this is attached. > > Couldn't the Unique node be removed entirely? If k is a primary key, you can't have duplicates in need of removal. Or would that be a subject for a different patch? I think remove_functionally_dependant_groupclauses should have a more generic name, like remove_functionally_dependant_clauses. Cheers, Jeff