Re: [HACKERS] Performance improvement for joins where outer side is unique

David Rowley Fri, 27 Jan 2017 05:22:43 -0800

On 27 January 2017 at 12:39, Tom Lane <[email protected]> wrote:
> 2. In these same cases (unique/semi/anti joins), it is possible to avoid
> mark/restore overhead in a mergejoin, because we can tweak the executor
> logic to not require backing up the inner side.  This goes further than
> just tweaking the executor logic, though, because if we know we don't
> need mark/restore then that actually makes some plan shapes legal that
> weren't before: we don't need to interpose a Material node to protect
> join inputs that can't mark/restore.


I've made modifications in the attached to add this optimization, and
it's quite a significant improvement.

Setup:

create table t1 (id text primary key);
insert into t1 select x from generate_series(1,1000000) x(x);
create table t2 (id text primary key);
insert into t2 select x from generate_series(1,1000000) x(x);
vacuum freeze;

Query:
select count(*) from t1 inner join t2 on t1.id=t2.id;

Unpatched

Time: 369.618 ms
Time: 358.208 ms
Time: 357.094 ms
Time: 355.642 ms
Time: 358.193 ms
Time: 371.272 ms
Time: 354.386 ms
Time: 364.277 ms
Time: 346.091 ms
Time: 358.757 ms

Patched

Time: 273.154 ms
Time: 258.520 ms
Time: 269.456 ms
Time: 252.861 ms
Time: 271.015 ms
Time: 252.567 ms
Time: 271.132 ms
Time: 267.505 ms
Time: 265.295 ms
Time: 257.068 ms

About 26.5% improvement for this case.

> * The patch applies point #1 to only INNER and LEFT join modes, but
> I don't really see why the idea wouldn't work for RIGHT and FULL modes,
> ie the optimization seems potentially interesting for all executable join
> types.  Once you've got a match, you can immediately go to the next outer
> tuple instead of continuing to scan inner.  (Am I missing something?)

No I believe I started development with LEFT JOINs, moved to INNER,
then didn't progress to think of RIGHT or FULL. I've lifted this
restriction in the patch. It seems no other work is required to have
that just work.

> * Particularly in view of the preceding point, I'm not that happy with
> the way that management/caching of the "is it unique" knowledge is
> done completely differently for INNER and LEFT joins.  I wonder if
> there's actually a good argument for that or is it mostly a development
> sequence artifact.  IOW, would it hurt to drop the SpecialJoinInfo
> tie-in and just rely on the generic cache?

I agree that special handling of one join type is not so pretty.
However, LEFT JOINs still remain a bit special as they're the only
ones we currently perform join removal on, and the patch modifies that
code to make use of the new flag for those. This can improve planner
performance of join removal when a join is removed successfully, as
the previous code had to recheck uniqueness of each remaining LEFT
JOIN again, whereas the new code only checks uniqueness of ones not
previously marked as unique. This too likely could be done with the
cache, although I'm a bit concerned with populating the cache, then
performing a bunch of LEFT JOIN removals and leaving relids in the
cache which no longer exist. Perhaps it's OK. I've just not found
proofs in my head yet that it is.

> * Because of where we apply the short-circuit logic in the executor,
> it's only safe to consider the inner rel as unique if it is provably
> unique using only the join clauses that drive the primary join mechanism
> (ie, the "joinquals" not the "otherquals").  We already do ignore quals
> that are pushed-down to an outer join, so that's good, but there's an
> oversight: we will use a qual that is mergejoinable even if it's not
> hashjoinable. That means we could get the wrong answers in a hash join.
> I think probably the appropriate fix for the moment is just to consider
> only clauses that are both mergeable and hashable while trying to prove
> uniqueness.  We do have some equality operators that support only one
> or the other, but they're corner cases, and I'm dubious that it's worth
> having to make separate proofs for merge and hash joins in order to
> cater to those cases.

hmm. I'm having trouble understanding why this is a problem for Unique
joins, but not for join removal?

However you mentioning this cause me to notice that this is only true
in the patch for left joins, and the other join types are not
consistent with that.

create table a1 (a int, b int, primary key(a,b));
create table a2 (a int, b int, primary key(a,b));

explain (verbose, costs off) select * from a1 left join a2 on a1.a =
a2.a and a2.b=1 ;
                    QUERY PLAN
--------------------------------------------------
 Merge Left Join
   Output: a1.a, a1.b, a2.a, a2.b
   Inner Unique: Yes
   Merge Cond: (a1.a = a2.a)
   ->  Index Only Scan using a1_pkey on public.a1
         Output: a1.a, a1.b
   ->  Sort
         Output: a2.a, a2.b
         Sort Key: a2.a
         ->  Bitmap Heap Scan on public.a2
               Output: a2.a, a2.b
               Recheck Cond: (a2.b = 1)
               ->  Bitmap Index Scan on a2_pkey
                     Index Cond: (a2.b = 1)


explain (verbose, costs off) select * from a1 inner join a2 on a1.a =
a2.a and a2.b=1 ;
                    QUERY PLAN
--------------------------------------------------
 Nested Loop
   Output: a1.a, a1.b, a2.a, a2.b
   Inner Unique: No
   ->  Bitmap Heap Scan on public.a2
         Output: a2.a, a2.b
         Recheck Cond: (a2.b = 1)
         ->  Bitmap Index Scan on a2_pkey
               Index Cond: (a2.b = 1)
   ->  Index Only Scan using a1_pkey on public.a1
         Output: a1.a, a1.b
         Index Cond: (a1.a = a2.a)

Notice the inner join is not detected as unique, but the left join is.

This is still not right in the attached. I'm not quite sure what to do
about it yet.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

unique_joins_2017-01-28.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance improvement for joins where outer side is unique

Reply via email to