Hi Ashutosh: Thanks for your time. On Fri, Feb 7, 2020 at 11:54 PM Ashutosh Bapat <ashutosh.bapat....@gmail.com> wrote:
> Hi Andy, > What might help is to add more description to your email message like > giving examples to explain your idea. > > Anyway, I looked at the testcases you added for examples. > +create table select_distinct_a(a int, b char(20), c char(20) not null, > d int, e int, primary key(a, b)); > +set enable_mergejoin to off; > +set enable_hashjoin to off; > +-- no node for distinct. > +explain (costs off) select distinct * from select_distinct_a; > + QUERY PLAN > +------------------------------- > + Seq Scan on select_distinct_a > +(1 row) > > From this example, it seems that the distinct operation can be dropped > because (a, b) is a primary key. Is my understanding correct? > Yes, you are correct. Actually I added then to commit message, but it's true that I should have copied them in this email body as well. so copy it now. [PATCH] Erase the distinctClause if the result is unique by definition For a single relation, we can tell it by any one of the following is true: 1. The pk is in the target list. 2. The uk is in the target list and the columns is not null 3. The columns in group-by clause is also in the target list for relation join, we can tell it by: if every relation in the jointree yields a unique result set, then the final result is unique as well regardless the join method. for semi/anti join, we will ignore the righttable. I like the idea since it eliminates one expensive operation. > > However the patch as presented has some problems > 1. What happens if the primary key constraint or NOT NULL constraint gets > dropped between a prepare and execute? The plan will no more be valid and > thus execution may produce non-distinct results. > Will this still be an issue if user use doesn't use a "read uncommitted" isolation level? I suppose it should be ok for this case. But even though I should add an isolation level check for this. Just added that in the patch to continue discussing of this issue. > PostgreSQL has similar concept of allowing non-grouping expression as part > of targetlist when those expressions can be proved to be functionally > dependent on the GROUP BY clause. See check_functional_grouping() and its > caller. I think, DISTINCT elimination should work on similar lines. > 2. For the same reason described in check_functional_grouping(), using > unique indexes for eliminating DISTINCT should be discouraged. > I checked the comments of check_functional_grouping, the reason is * Currently we only check to see if the rel has a primary key that is a * subset of the grouping_columns. We could also use plain unique constraints * if all their columns are known not null, but there's a problem: we need * to be able to represent the not-null-ness as part of the constraints added * to *constraintDeps. FIXME whenever not-null constraints get represented * in pg_constraint. Actually I am doubtful the reason for pg_constraint since we still be able to get the not null information from relation->rd_attr->attrs[n].attnotnull which is just what this patch did. 3. If you could eliminate DISTINCT you could similarly eliminate GROUP BY > as well > This is a good point. The rules may have some different for join, so I prefer to to focus on the current one so far. > 4. The patch works only at the query level, but that functionality can be > expanded generally to other places which add Unique/HashAggregate/Group > nodes if the underlying relation can be proved to produce distinct rows. > But that's probably more work since we will have to label paths with unique > keys similar to pathkeys. > Do you mean adding some information into PlannerInfo, and when we create a node for Unique/HashAggregate/Group, we can just create a dummy node? > 5. Have you tested this OUTER joins, which can render inner side nullable? > Yes, that part was missed in the test case. I just added them. On Thu, Feb 6, 2020 at 11:31 AM Andy Fan <zhihui.fan1...@gmail.com> wrote: > >> update the patch with considering the semi/anti join. >> >> Can anyone help to review this patch? >> >> Thanks >> >> >> On Fri, Jan 31, 2020 at 8:39 PM Andy Fan <zhihui.fan1...@gmail.com> >> wrote: >> >>> Hi: >>> >>> I wrote a patch to erase the distinctClause if the result is unique by >>> definition, I find this because a user switch this code from oracle >>> to PG and find the performance is bad due to this, so I adapt pg for >>> this as well. >>> >>> This patch doesn't work for a well-written SQL, but some drawback >>> of a SQL may be not very obvious, since the cost of checking is pretty >>> low as well, so I think it would be ok to add.. >>> >>> Please see the patch for details. >>> >>> Thank you. >>> >> > > -- > -- > Best Wishes, > Ashutosh Bapat >
0001-Erase-the-distinctClause-if-the-result-is-unique-by-.patch
Description: Binary data