Re: [PATCH] Erase the distinctClause if the result is unique by definition

Andy Fan Thu, 12 Mar 2020 18:47:54 -0700

Hi David:

On Thu, Mar 12, 2020 at 3:51 PM David Rowley <dgrowle...@gmail.com> wrote:


> On Wed, 11 Mar 2020 at 17:23, Andy Fan <zhihui.fan1...@gmail.com> wrote:
> > Now I am convinced that we should maintain UniquePath on RelOptInfo,
> > I would see how to work with "Index Skip Scan" patch.
>
> I've attached a very early proof of concept patch for unique keys.
> The NULL detection stuff is not yet hooked up, so it'll currently do
> the wrong thing for NULLable columns.  I've left some code in there
> with my current idea of how to handle that, but I'll need to add more
> code both to look at the catalogue tables to see if there's a NOT NULL
> constraint and also to check for strict quals that filter out NULLs.
>
> Additionally, I've not hooked up the collation checking stuff yet. I
> just wanted to see if it would work ok for non-collatable types first.
>
> I've added a couple of lines to create_distinct_paths() to check if
> the input_rel has the required UniqueKeys to skip doing the DISTINCT.
> It seems to work, but my tests so far are limited to:
>
> create table t1(a int primary key, b int);
> create table t2(a int primary key, b int);
>
> postgres=# -- t2 could duplicate t1, don't remove DISTINCT
> postgres=# explain (costs off) select distinct t1.a from t1 inner join
> t2 on t1.a = t2.b;
>             QUERY PLAN
> ----------------------------------
>  HashAggregate
>    Group Key: t1.a
>    ->  Hash Join
>          Hash Cond: (t2.b = t1.a)
>          ->  Seq Scan on t2
>          ->  Hash
>                ->  Seq Scan on t1
> (7 rows)
>
>
> postgres=# -- neither rel can duplicate the other due to join on PK.
> Remove DISTINCT
> postgres=# explain (costs off) select distinct t1.a from t1 inner join
> t2 on t1.a = t2.a;
>          QUERY PLAN
> ----------------------------
>  Hash Join
>    Hash Cond: (t1.a = t2.a)
>    ->  Seq Scan on t1
>    ->  Hash
>          ->  Seq Scan on t2
> (5 rows)
>
>
> postgres=# -- t2.a cannot duplicate t1 and t1.a is unique. Remove DISTINCT
> postgres=# explain (costs off) select distinct t1.a from t1 inner join
> t2 on t1.b = t2.a;
>          QUERY PLAN
> ----------------------------
>  Hash Join
>    Hash Cond: (t1.b = t2.a)
>    ->  Seq Scan on t1
>    ->  Hash
>          ->  Seq Scan on t2
> (5 rows)
>
>
> postgres=# -- t1.b can duplicate t2.a. Don't remove DISTINCT
> postgres=# explain (costs off) select distinct t2.a from t1 inner join
> t2 on t1.b = t2.a;
>             QUERY PLAN
> ----------------------------------
>  HashAggregate
>    Group Key: t2.a
>    ->  Hash Join
>          Hash Cond: (t1.b = t2.a)
>          ->  Seq Scan on t1
>          ->  Hash
>                ->  Seq Scan on t2
> (7 rows)
>
>
> postgres=# -- t1.a cannot duplicate t2.a. Remove DISTINCT.
> postgres=# explain (costs off) select distinct t2.a from t1 inner join
> t2 on t1.a = t2.b;
>          QUERY PLAN
> ----------------------------
>  Hash Join
>    Hash Cond: (t2.b = t1.a)
>    ->  Seq Scan on t2
>    ->  Hash
>          ->  Seq Scan on t1
> (5 rows)
>
> I've also left a bunch of XXX comments for things that I know need more
> thought.
>
> I believe we can propagate the joinrel's unique keys where the patch
> is currently doing it.  I understand that in
> populate_joinrel_with_paths() we do things like swapping LEFT JOINs
> for RIGHT JOINs and switch the input rels around, but we do so only
> because it's equivalent, so I don't currently see why we can't take
> the jointype for the SpecialJoinInfo. I need to know that as I'll need
> to ignore pushed down RestrictInfos for outer joins.
>
> I'm posting now as I know I've been mentioning this UniqueKeys idea
> for quite a while and if it's not something that's going to get off
> the ground, then it's better to figure that out now.
>

Thanks for the code!   Here is some points from me.

1.   for pupulate_baserel_uniquekeys,  we need handle the "pk = Const" as
well.
(relation_has_unqiue_for has a similar logic) currently the following
distinct path is still
 there.

postgres=# explain select distinct b from t100 where pk = 1;
                                    QUERY PLAN
----------------------------------------------------------------------------------
 Unique  (cost=8.18..8.19 rows=1 width=4)
   ->  Sort  (cost=8.18..8.19 rows=1 width=4)
         Sort Key: b
         ->  Index Scan using t100_pkey on t100  (cost=0.15..8.17 rows=1
width=4)
               Index Cond: (pk = 1)
(5 rows)

I think in this case,  we can add  both (pk) and (b) as the UniquePaths.
If so we
can get more opportunities to reach our goal.

2.  As for the propagate_unique_keys_to_joinrel,  we can add 1 more
UniquePath as
(rel1_unique_paths,  rel2_unique_paths) if the current rules doesn't apply.
or else the following cases can't be handled.

postgres=# explain select distinct t100.pk,  t101.pk from t100, t101;
                                   QUERY PLAN
--------------------------------------------------------------------------------
 Unique  (cost=772674.11..810981.11 rows=5107600 width=8)
   ->  Sort  (cost=772674.11..785443.11 rows=5107600 width=8)
         Sort Key: t100.pk, t101.pk
         ->  Nested Loop  (cost=0.00..63915.85 rows=5107600 width=8)
               ->  Seq Scan on t100  (cost=0.00..32.60 rows=2260 width=4)
               ->  Materialize  (cost=0.00..43.90 rows=2260 width=4)
                     ->  Seq Scan on t101  (cost=0.00..32.60 rows=2260
width=4)
(7 rows)

But if we add such rule,  the unique paths probably become much longer,  so
we need
a strategy to tell if the UniquePath is useful for our query,  if not,  we
can ignore that.
rel->reltarget maybe a good info for such optimization.   I think we can
take this into
consideration for pupulate_baserel_uniquekeys as well.

For the non_null info,  Tom suggested to add maintain such info RelOptInfo,
I have done that for the not_null_info for basic relation catalog,  I think
we can
maintain the same flag for joinrel and the not null info from
find_nonnullable_vars as
well, but I still didn't find a good place to add that so far.


A small question about the following code:

+       if (relation_has_uniquekeys_for(root, input_rel,
get_sortgrouplist_exprs(parse->distinctClause, parse->targetList), false))
+       {
+
+               add_path(distinct_rel, (Path *)cheapest_input_path);
+
+               /* XXX yeah yeah, need to call the hooks etc. */
+
+               /* Now choose the best path(s) */
+               set_cheapest(distinct_rel);
+
+               return distinct_rel;
+       }

Since we don't create new RelOptInfo/Path,  do we need to call add_path and
set_cheapest?


Best Regards
Andy Fan

Re: [PATCH] Erase the distinctClause if the result is unique by definition

Reply via email to