Re: [PATCH] Erase the distinctClause if the result is unique by definition

Andy Fan Fri, 07 Feb 2020 23:23:55 -0800

Hi Ashutosh:
   Thanks for your time.

On Fri, Feb 7, 2020 at 11:54 PM Ashutosh Bapat <ashutosh.bapat....@gmail.com>
wrote:

> Hi Andy,
> What might help is to add more description to your email message like
> giving examples to explain your idea.
>
> Anyway, I looked at the testcases you added for examples.
> +create table select_distinct_a(a int, b char(20),  c char(20) not null,
>  d int, e int, primary key(a, b));
> +set enable_mergejoin to off;
> +set enable_hashjoin to off;
> +-- no node for distinct.
> +explain (costs off) select distinct * from select_distinct_a;
> +          QUERY PLAN
> +-------------------------------
> + Seq Scan on select_distinct_a
> +(1 row)
>
> From this example, it seems that the distinct operation can be dropped
> because (a, b) is a primary key. Is my understanding correct?
>

Yes, you are correct.   Actually I added then to commit message,
but it's true that I should have copied them in this email body as
 well.  so copy it now.

[PATCH] Erase the distinctClause if the result is unique by
 definition

For a single relation, we can tell it by any one of the following
is true:
1. The pk is in the target list.
2. The uk is in the target list and the columns is not null
3. The columns in group-by clause is also in the target list

for relation join, we can tell it by:
if every relation in the jointree yields a unique result set, then
the final result is unique as well regardless the join method.
for semi/anti join, we will ignore the righttable.

I like the idea since it eliminates one expensive operation.
>
> However the patch as presented has some problems
> 1. What happens if the primary key constraint or NOT NULL constraint gets
> dropped between a prepare and execute? The plan will no more be valid and
> thus execution may produce non-distinct results.
>

Will this still be an issue if user use doesn't use a "read uncommitted"
isolation level?  I suppose it should be ok for this case.  But even though
I should add an isolation level check for this.  Just added that in the
patch
to continue discussing of this issue.

> PostgreSQL has similar concept of allowing non-grouping expression as part
> of targetlist when those expressions can be proved to be functionally
> dependent on the GROUP BY clause. See check_functional_grouping() and its
> caller. I think, DISTINCT elimination should work on similar lines.
>
2. For the same reason described in check_functional_grouping(), using
> unique indexes for eliminating DISTINCT should be discouraged.
>

I checked the comments of check_functional_grouping,  the reason is

 * Currently we only check to see if the rel has a primary key that is a
 * subset of the grouping_columns.  We could also use plain unique
constraints
 * if all their columns are known not null, but there's a problem: we need
 * to be able to represent the not-null-ness as part of the constraints
added
 * to *constraintDeps.  FIXME whenever not-null constraints get represented
 * in pg_constraint.

Actually I am doubtful the reason for pg_constraint since we still be able
to get the not null information from relation->rd_attr->attrs[n].attnotnull
which
is just what this patch did.

3. If you could eliminate DISTINCT you could similarly eliminate GROUP BY
> as well
>

This is a good point.   The rules may have some different for join,  so I
prefer
to to focus on the current one so far.

> 4. The patch works only at the query level, but that functionality can be
> expanded generally to other places which add Unique/HashAggregate/Group
> nodes if the underlying relation can be proved to produce distinct rows.
> But that's probably more work since we will have to label paths with unique
> keys similar to pathkeys.
>

Do you mean adding some information into PlannerInfo,  and when we create
a node for Unique/HashAggregate/Group,  we can just create a dummy node?

> 5. Have you tested this OUTER joins, which can render inner side nullable?
>

Yes, that part was missed in the test case.  I just added them.

On Thu, Feb 6, 2020 at 11:31 AM Andy Fan <zhihui.fan1...@gmail.com> wrote:
>
>> update the patch with considering the semi/anti join.
>>
>> Can anyone help to review this patch?
>>
>> Thanks
>>
>>
>> On Fri, Jan 31, 2020 at 8:39 PM Andy Fan <zhihui.fan1...@gmail.com>
>> wrote:
>>
>>> Hi:
>>>
>>> I wrote a patch to erase the distinctClause if the result is unique by
>>> definition,  I find this because a user switch this code from oracle
>>> to PG and find the performance is bad due to this,  so I adapt pg for
>>> this as well.
>>>
>>> This patch doesn't work for a well-written SQL,  but some drawback
>>> of a SQL may be not very obvious,  since the cost of checking is pretty
>>> low as well,  so I think it would be ok to add..
>>>
>>> Please see the patch for details.
>>>
>>> Thank you.
>>>
>>
>
> --
> --
> Best Wishes,
> Ashutosh Bapat
>

0001-Erase-the-distinctClause-if-the-result-is-unique-by-.patch
Description: Binary data

Re: [PATCH] Erase the distinctClause if the result is unique by definition

Reply via email to