Re: [HACKERS] Performance improvement for joins where outer side is unique

David Rowley Mon, 13 Mar 2017 15:36:41 -0700

On 14 March 2017 at 07:50, Tom Lane <t...@sss.pgh.pa.us> wrote:

> [ getting back to this patch finally... ]
>
> David Rowley <david.row...@2ndquadrant.com> writes:
> > I've attached a patch which implements this, though only for
> > MergeJoin, else I'd imagine we'd also need to ensure all proofs used
> > for testing the uniqueness were also hash-able too. I added some XXX
> > comments in analyzejoin.c around the mergeopfamilies == NIL tests to
> > mention that Merge Join depends on all the unique proof quals having
> > mergeopfamilies.  This also assumes we'll never use some subset of
> > mergejoin-able quals for a merge join, which could be an interesting
> > area in the future, as we might have some btree index on a subset of
> > those columns to provide pre-sorted input. In short, it's a great
> > optimisation, but I'm a little scared we might break it one day.
>
> Umm ... it's broken already isn't it?  See the comment for
> generate_mergejoin_paths:
>
>  * We generate mergejoins if mergejoin clauses are available.  We have
>  * two ways to generate the inner path for a mergejoin: sort the cheapest
>  * inner path, or use an inner path that is already suitably ordered for
> the
>  * merge.  If we have several mergeclauses, it could be that there is no
> inner
>  * path (or only a very expensive one) for the full list of mergeclauses,
> but
>  * better paths exist if we truncate the mergeclause list (thereby
> discarding
>  * some sort key requirements).  So, we consider truncations of the
>  * mergeclause list as well as the full list.  (Ideally we'd consider all
>  * subsets of the mergeclause list, but that seems way too expensive.)
>
> There's another, more subtle, way in which it could fail in
> sort_inner_and_outer():
>
>  * Each possible ordering of the available mergejoin clauses will generate
>  * a differently-sorted result path at essentially the same cost.  We have
>  * no basis for choosing one over another at this level of joining, but
>  * some sort orders may be more useful than others for higher-level
>  * mergejoins, so it's worth considering multiple orderings.
>  *
>  * Actually, it's not quite true that every mergeclause ordering will
>  * generate a different path order, because some of the clauses may be
>  * partially redundant (refer to the same EquivalenceClasses).  Therefore,
>  * what we do is convert the mergeclause list to a list of canonical
>  * pathkeys, and then consider different orderings of the pathkeys.
>
> I'm fairly sure that it's possible to end up with fewer pathkeys than
> there are mergeclauses in this code path.  Now, you might be all right
> anyway given that the mergeclauses must refer to the same ECs in such a
> case --- maybe they're fully redundant and we can take testing the
> included clause as proving the omitted one(s) too.  I'm not certain
> right now what I meant by "partially redundant" in this comment.
> But in any case, it's moot for the present purpose because
> generate_mergejoin_paths certainly breaks your assumption.
>


Thanks for looking at this again.

Yeah confirmed. It's broken. I guess I just need to remember in the Path if
we got all the join quals, although I've not looked in detail what the best
fix is. I imagine it'll require storing something else in the JoinPath.

Here's my test case:

select 'create tablespace slow_disk LOCATION ''' ||
current_setting('data_directory') || ''' WITH (random_page_cost=1000);';
\gexec
create table ab (a int not null, b int not null);
create unique index ab_pkey on ab (a,b) tablespace slow_disk;
alter table ab add constraint ab_pkey primary key using index ab_pkey;

create index ab_a_idx on ab (a);

insert into ab values(1,1);
insert into ab values(1,2);

set enable_nestloop to 0;
set enable_hashjoin to 0;
set enable_sort to 0;

explain verbose select * from ab ab1 inner join ab ab2 on ab1.a = ab2.a and
ab1.b = ab2.b;
                                     QUERY PLAN

-------------------------------------------------------------------------------------
 Merge Join  (cost=0.26..24.35 rows=1 width=16)
   Output: ab1.a, ab1.b, ab2.a, ab2.b
   Inner Unique: Yes
   Merge Cond: (ab1.a = ab2.a)
   Join Filter: (ab1.b = ab2.b)
   ->  Index Scan using ab_a_idx on public.ab ab1  (cost=0.13..12.16 rows=2
width=8)
         Output: ab1.a, ab1.b
   ->  Index Scan using ab_a_idx on public.ab ab2  (cost=0.13..12.16 rows=2
width=8)
         Output: ab2.a, ab2.b
(9 rows)

select * from ab ab1 inner join ab ab2 on ab1.a = ab2.a and ab1.b = ab2.b;
-- wrong results
 a | b | a | b
---+---+---+---
 1 | 2 | 1 | 2
(1 row)

drop index ab_a_idx;

-- same query again. This time it'll use the PK index on the slow_disk
tablespace
select * from ab ab1 inner join ab ab2 on ab1.a = ab2.a and ab1.b = ab2.b;
 a | b | a | b
---+---+---+---
 1 | 1 | 1 | 1
 1 | 2 | 1 | 2
(2 rows)

I had to fix up some conflicts before testing this again on a recent
master, so I've attached my rebased version. No fix yet for the above issue


-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

unique_joins_2017-03-14.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

Reply via email to