Hi David: On Thu, Mar 12, 2020 at 3:51 PM David Rowley <dgrowle...@gmail.com> wrote:
> On Wed, 11 Mar 2020 at 17:23, Andy Fan <zhihui.fan1...@gmail.com> wrote: > > Now I am convinced that we should maintain UniquePath on RelOptInfo, > > I would see how to work with "Index Skip Scan" patch. > > I've attached a very early proof of concept patch for unique keys. > The NULL detection stuff is not yet hooked up, so it'll currently do > the wrong thing for NULLable columns. I've left some code in there > with my current idea of how to handle that, but I'll need to add more > code both to look at the catalogue tables to see if there's a NOT NULL > constraint and also to check for strict quals that filter out NULLs. > > Additionally, I've not hooked up the collation checking stuff yet. I > just wanted to see if it would work ok for non-collatable types first. > > I've added a couple of lines to create_distinct_paths() to check if > the input_rel has the required UniqueKeys to skip doing the DISTINCT. > It seems to work, but my tests so far are limited to: > > create table t1(a int primary key, b int); > create table t2(a int primary key, b int); > > postgres=# -- t2 could duplicate t1, don't remove DISTINCT > postgres=# explain (costs off) select distinct t1.a from t1 inner join > t2 on t1.a = t2.b; > QUERY PLAN > ---------------------------------- > HashAggregate > Group Key: t1.a > -> Hash Join > Hash Cond: (t2.b = t1.a) > -> Seq Scan on t2 > -> Hash > -> Seq Scan on t1 > (7 rows) > > > postgres=# -- neither rel can duplicate the other due to join on PK. > Remove DISTINCT > postgres=# explain (costs off) select distinct t1.a from t1 inner join > t2 on t1.a = t2.a; > QUERY PLAN > ---------------------------- > Hash Join > Hash Cond: (t1.a = t2.a) > -> Seq Scan on t1 > -> Hash > -> Seq Scan on t2 > (5 rows) > > > postgres=# -- t2.a cannot duplicate t1 and t1.a is unique. Remove DISTINCT > postgres=# explain (costs off) select distinct t1.a from t1 inner join > t2 on t1.b = t2.a; > QUERY PLAN > ---------------------------- > Hash Join > Hash Cond: (t1.b = t2.a) > -> Seq Scan on t1 > -> Hash > -> Seq Scan on t2 > (5 rows) > > > postgres=# -- t1.b can duplicate t2.a. Don't remove DISTINCT > postgres=# explain (costs off) select distinct t2.a from t1 inner join > t2 on t1.b = t2.a; > QUERY PLAN > ---------------------------------- > HashAggregate > Group Key: t2.a > -> Hash Join > Hash Cond: (t1.b = t2.a) > -> Seq Scan on t1 > -> Hash > -> Seq Scan on t2 > (7 rows) > > > postgres=# -- t1.a cannot duplicate t2.a. Remove DISTINCT. > postgres=# explain (costs off) select distinct t2.a from t1 inner join > t2 on t1.a = t2.b; > QUERY PLAN > ---------------------------- > Hash Join > Hash Cond: (t2.b = t1.a) > -> Seq Scan on t2 > -> Hash > -> Seq Scan on t1 > (5 rows) > > I've also left a bunch of XXX comments for things that I know need more > thought. > > I believe we can propagate the joinrel's unique keys where the patch > is currently doing it. I understand that in > populate_joinrel_with_paths() we do things like swapping LEFT JOINs > for RIGHT JOINs and switch the input rels around, but we do so only > because it's equivalent, so I don't currently see why we can't take > the jointype for the SpecialJoinInfo. I need to know that as I'll need > to ignore pushed down RestrictInfos for outer joins. > > I'm posting now as I know I've been mentioning this UniqueKeys idea > for quite a while and if it's not something that's going to get off > the ground, then it's better to figure that out now. > Thanks for the code! Here is some points from me. 1. for pupulate_baserel_uniquekeys, we need handle the "pk = Const" as well. (relation_has_unqiue_for has a similar logic) currently the following distinct path is still there. postgres=# explain select distinct b from t100 where pk = 1; QUERY PLAN ---------------------------------------------------------------------------------- Unique (cost=8.18..8.19 rows=1 width=4) -> Sort (cost=8.18..8.19 rows=1 width=4) Sort Key: b -> Index Scan using t100_pkey on t100 (cost=0.15..8.17 rows=1 width=4) Index Cond: (pk = 1) (5 rows) I think in this case, we can add both (pk) and (b) as the UniquePaths. If so we can get more opportunities to reach our goal. 2. As for the propagate_unique_keys_to_joinrel, we can add 1 more UniquePath as (rel1_unique_paths, rel2_unique_paths) if the current rules doesn't apply. or else the following cases can't be handled. postgres=# explain select distinct t100.pk, t101.pk from t100, t101; QUERY PLAN -------------------------------------------------------------------------------- Unique (cost=772674.11..810981.11 rows=5107600 width=8) -> Sort (cost=772674.11..785443.11 rows=5107600 width=8) Sort Key: t100.pk, t101.pk -> Nested Loop (cost=0.00..63915.85 rows=5107600 width=8) -> Seq Scan on t100 (cost=0.00..32.60 rows=2260 width=4) -> Materialize (cost=0.00..43.90 rows=2260 width=4) -> Seq Scan on t101 (cost=0.00..32.60 rows=2260 width=4) (7 rows) But if we add such rule, the unique paths probably become much longer, so we need a strategy to tell if the UniquePath is useful for our query, if not, we can ignore that. rel->reltarget maybe a good info for such optimization. I think we can take this into consideration for pupulate_baserel_uniquekeys as well. For the non_null info, Tom suggested to add maintain such info RelOptInfo, I have done that for the not_null_info for basic relation catalog, I think we can maintain the same flag for joinrel and the not null info from find_nonnullable_vars as well, but I still didn't find a good place to add that so far. A small question about the following code: + if (relation_has_uniquekeys_for(root, input_rel, get_sortgrouplist_exprs(parse->distinctClause, parse->targetList), false)) + { + + add_path(distinct_rel, (Path *)cheapest_input_path); + + /* XXX yeah yeah, need to call the hooks etc. */ + + /* Now choose the best path(s) */ + set_cheapest(distinct_rel); + + return distinct_rel; + } Since we don't create new RelOptInfo/Path, do we need to call add_path and set_cheapest? Best Regards Andy Fan