Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

Ashutosh Bapat Mon, 24 Apr 2017 04:06:51 -0700

On Sat, Apr 22, 2017 at 3:52 AM, Robert Haas <[email protected]> wrote:
> On Fri, Apr 21, 2017 at 8:41 AM, Ashutosh Bapat
> <[email protected]> wrote:
>> I don't see how is that fixed. For a join relation we need to come up
>> with one set of partition bounds by merging partition bounds of the
>> joining relation and in order to understand how to interpret the
>> datums in the partition bounds, we need to associate data types. The
>> question is which data type we should use if the relations being
>> joined have different data types associated with their respective
>> partition bounds.
>>
>> Or are you saying that we don't need to associate data type with
>> merged partition bounds? In that case, I don't know how do we compare
>> the partition bounds of two relations?
>
> Well, since there is no guarantee that a datatype exists which can be
> used to "merge" the partition bounds in the sense that you are
> describing, and even if there is one we have no opfamily
> infrastructure to find out which one it is, I think it would be smart
> to try to set things up so that we don't need to do that.  I believe
> that's probably possible.
>
>> In your example, A has partition key of type int8, has bound datums
>> X1.. X10. B has partition key of type int4 and has bounds datums X1 ..
>> X11. C has partition key type int2 and bound datums X1 .. X12.
>
> OK, sure.
>
>> The binary representation of X's is going to differ between A, B and C
>> although each Xk for A, B and C is equal, wherever exists.
>
> Agreed.
>
>> Join
>> between A and B will have merged bound datums X1 .. X10 (and X11
>> depending upon the join type). In order to match bounds of AB with C,
>> we need to know the data type of bounds of AB, so that we can choose
>> appropriate equality operator. The question is what should we choose
>> as data type of partition bounds of AB, int8 or int4. This is
>> different from applying join conditions between AB and C, which can
>> choose the right opfamily operator based on the join conditions.
>
> Well, the join is actually being performed either on A.keycol =
> C.keycol or on B.keycol = C.keycol, right?  It has to be one or the
> other; there's no "merged" join column in any relation's targetlist,
> but only columns derived from the various baserels.  So let's use that
> set of bounds for the matching.  It makes sense to use the set of
> bounds for the matching that corresponds to the column actually being
> joined, I think.
>
> It's late here and I'm tired, but it seems like it should be possible
> to relate the child joinrels of the AB join back to the child joinrels
> of either A or B.  (AB)1 .. (AB)10 related back to A1 .. A10 and B1 ..
> B10.  (AB)11 relates back to B11 but, of course not to A11, which
> doesn't exist.  If the join is INNER, (AB)11 is a dummy rel anyway and
> actually we should probably see whether we can omit it altogether.  If
> the join is an outer join of some kind, there's an interesting case
> where the user wrote A LEFT JOIN B or B RIGHT JOIN A so that A is not
> on the nullable side of the join; in that case, too, (AB)11 is dummy
> or nonexistent.  Otherwise, assuming A is nullable, (AB)11 maps only
> to B11 and not to A11.  But that's absolutely right: if the join to C
> uses A.keycol, either the join operator is strict and (AB)11 won't
> match anything anyway, or it's not and partition-wise join is illegal
> because A.keycol in (AB)11 can include not only values from X11 but
> also nulls.
>
> So, it seems to me that what you can do is loop over the childrels on
> the outer side of the join.  For each one, you've got a join clause
> that relates the outer rel to the inner rel, and that join clause
> mentions some baserel which is contained in the joinrel.  So drill
> down through the childrel to the corresponding partition of the
> baserel and get those bounds.  Then if you do the same thing for the
> inner childrels, you've now got two lists of bounds, and the type on
> the left matches the outer side of the join and the type on the right
> matches the inner side of the join and the opfamily of the operator in
> the join clause gives you a comparison operator that relates those two
> types, and now you can match them up.


This assumes that datums in partition bounds have one to one mapping
with the partitions, which isn't true for list partitions. For list
partitions we have multiple datums corresponding to the items listed
associated with a given partition. So, simply looping over the
partitions of outer relations doesn't work; in fact there are two
outer relations for a full outer join, so we have to loop over both of
them together in a merge join fashion.

Consider A join B where A has partitions A1 (a, b, c), A2(e, f), A3(g,
h) and B has partitions B1 (a, b), B2 (c, d, e), B3(f, g, h). If we
just look at the partitions, we won't recognize that list item c is
repeated in A1B1 and A2B2. That can be recognized only when we loop
over the datums of A and B trying to match the partitions. We will see
that for a, b A1 and B1 match but for c A1 and B1 do not match,
instead A1 and B2 match. In one to one partition matching we will bail
out here.

I think, we have to find the base relations whose partition bounds
should be used for comparison looking at the equi-join conditions and
then compare those partition bounds to come up with the partition
bounds of join relation. That won't work straight forward either when
their are partitions missing on either sides of the join, I guess.
Needs a careful thought.

>
> (We should also keep in mind the case where there are multiple columns
> in the partition key.)

Yes. This is tricky. Consider A partitioned by (a1, a2) B partitioned
by (b1, b2) and C partitioned by (c1, c2). If the query is A join B on
(A.a1 = B.a1 and A.a2 = B.b2) join C on (C.c1 = A.a1 and C.c2 = B.b2),
we need to fetch partition bound values for a1 from A's partition
bounds and those for b1 from B's partition bounds. Create combined
partition bounds from those and then compare the combined bounds with
those of C.

After saying all that, I think we have a precedence of merged join
columns with merged data types. Consider
create table t1(a int2, b int);
 create table t2 (a int4, b int);
explain verbose select * from t1 join t2 using(a);
                               QUERY PLAN
-------------------------------------------------------------------------
 Merge Join  (cost=327.25..745.35 rows=27120 width=12)
   Output: t2.a, t1.b, t2.b
   Merge Cond: (t2.a = t1.a)
   ->  Sort  (cost=158.51..164.16 rows=2260 width=8)
         Output: t2.a, t2.b
         Sort Key: t2.a
         ->  Seq Scan on public.t2  (cost=0.00..32.60 rows=2260 width=8)
               Output: t2.a, t2.b
   ->  Sort  (cost=168.75..174.75 rows=2400 width=6)
         Output: t1.b, t1.a
         Sort Key: t1.a
         ->  Seq Scan on public.t1  (cost=0.00..34.00 rows=2400 width=6)
               Output: t1.b, t1.a
(13 rows)

When using clause is used the columns specified by using clause from
the joining relations are merged into a single column. Here it has
used a "wider" type column t2.a as the merged column for t1.a and
t2.a. The logic is in buildMergedJoinVar().

Probably we want to build merged partition bounds for a join relation
where partition keys of the joining relations are different using a
single data type provided by the same logic as buildMergedJoinVar()
and attach those to the join relation.

[1] http://www.mail-archive.com/[email protected]/msg312629.html

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

Reply via email to