Hi Pierre, On Fri, Jul 31, 2020 at 2:11 PM Pierre Ducroquet <p.p...@pinaraf.info> wrote: > > Hi > > In a recent audit, I noticed that application developers have a tendency to > abuse the distinct clause. For instance they use an ORM and add a distinct at > the top level just because they don't know the cost it has, or they don't know > that using EXISTS is a better way to express their queries than doing JOINs > (or worse, they can't do better). > > They thus have this kind of queries (considering tbl1 has a PK of course): > SELECT DISTINCT * FROM tbl1; > SELECT DISTINCT * FROM tbl1 ORDER BY a; > SELECT DISTINCT tbl1.* FROM tbl1 > JOIN tbl2 ON tbl2.a = tbl1.id; > > These can be transformed into: > SELECT * FROM tbl1 ORDER BY *;
We don't need an ORDER BY here since there's primary key on tbl1 and DISTINCT doesn't ensure ordered result. > SELECT * FROM tbl1 ORDER BY a; > SELECT tbl1.* FROM tbl1 SEMI-JOIN tbl2 ON tbl2.a = tbl1.id ORDER BY tbl1.*; > > The attached patch does that. > I added extra safeties in several place just to be sure I don't touch > something I can not handle, but I may have been very candid with the distinct > to sort transformation. > The cost of this optimization is quite low : for queries that don't have any > distinct, it's just one if. If there is a distinct, we iterate once through > every target, then we fetch the PK and iterate through the DISTINCT clause > fields. If it is possible to optimize, we then iterate through the JOINs. We are already discussing this feature at https://www.postgresql.org/message-id/flat/CAKJS1f-wH83Fi2coEVNUWFxOGQ4BJRRTGqDMvidCoiR9WEwxsw%40mail.gmail.com#56a08b441cc61afaf85c6232c5d40a3f. You are welcome to contribute your ideas/code/review on that thread. -- Best Wishes, Ashutosh Bapat