Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

Ashutosh Bapat Thu, 27 Apr 2017 00:42:07 -0700

On Wed, Apr 26, 2017 at 9:30 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Mon, Apr 24, 2017 at 7:06 AM, Ashutosh Bapat
> <ashutosh.ba...@enterprisedb.com> wrote:
>> This assumes that datums in partition bounds have one to one mapping
>> with the partitions, which isn't true for list partitions. For list
>> partitions we have multiple datums corresponding to the items listed
>> associated with a given partition. So, simply looping over the
>> partitions of outer relations doesn't work; in fact there are two
>> outer relations for a full outer join, so we have to loop over both of
>> them together in a merge join fashion.
>
> Maybe so, but my point is that it can be done with the original types,
> without converting anything to a different type.
>


Theoretically, I agree with this. But practically the implementation
is lot more complex than what you have described in the earlier mails.
I am afraid, that the patch with those changes will be a lot harder to
review and commit. Later in this mail, I will try to explain some of
the complexities.

>
> I'm going to say this one more time: I really, really, really think
> you need to avoid trying to convert the partition bounds to a common
> type.  I said before that the infrastructure to do that is not present
> in our type system, and I'm pretty sure that statement is 100%
> correct.  The fact that you can find other cases where we do something
> sorta like that but in a different case with different requirements
> doesn't make that false.

Ok. Thanks for the explanation.

The current design and implementation is for a restricted case where
the partition bounds, partition key types and numbers match exactly.
We want to commit an implementation which is reasonably extensible and
doesn't require a lot of changes when we add more capabilities. Some
of the extensions we discussed are as follows:
1. Partition-wise join when the existing partitions have matching
bounds/lists but there can be extra partitions on either side of the
join (between base relations or join relations) without a matching
partition on the other side.\
2. Partition-wise join when the partition bounds/lists do not match
exactly but there is 1:1 or 1:0 or 0:1 mapping between the partitions
which can contribute to the final result. E.g. A (0-100, 100 - 150,
200-300), B (0-50, 125-200, 300-400)
3. Partition-wise join when the partition key types do not match, but
there's a single opfamily being used for partitioning.
4. Partition-wise join where 1:m or m:n mapping exists between
partitions of the joining relations.


First one is clearly something that we will need. We may add it in the
first commit or next commit, but it will be needed pretty soon (v11?).
To me 2nd is more important than the 3rd one. You may have a different
view. We will expect 3rd optimization to work with all the prior
optimizations. I am restricting myself from thinking about 4th one
since that requires ganging together multiple RelOptInfos as a single
RelOptInfo while joining, something we don't have infrastruture for.

In case of first goal, supporting INNER joins and OUTER joins where
the partitions are missing on the OUTER side but not inner side are
easier. In those cases we just drop those partitions and corresponding
bounds/lists from the join. For a FULL OUTER join, where both sides
act as OUTER as well as INNER, we will need exact mapping between the
partitions. For supporting OUTER joins where partitions on the INNER
sides can be missing, we need to create some "dummy" relations
representing the missing partitions so that we have OUTER rows with
NULL inner side. This requires giving those dummy relations some
relids and thus in case of base relations we may need to inject some
dummy children. This may mean that we have to expand simple_rel_array
as part of outer join, may or may not require adding new
AppendRelInfos and so on. We are basically breaking an assumption that
base relations can not be introduced while planning joins and that
might require some rework in the existing infrastructure. There might
be other ways to introduce dummy relations during join planning, but I
haven't really thought through the problem.

The third goal requires that the partition bounds be compared based on
the partition keys present in the equi-join. While matching the
partitions to be joined, the partition bounds corresponding the base
relation whose partition keys appear in the equi-join are used for
comparison using support function corresponding to the data types of
partition keys. This requires us to associate the partitions of a join
with the bounds of base relation. E.g. A(A1, A2) with bounds (X1, X3)
(notice missing X2), B (B1, B2) bounds (X1, X2), C (C1, C2, C3) bounds
(X1, X2, X3) and the join is A LJ B on A.a = B.b LJ C on B.b = C.c
assuming strict operators this can be executed as (AB)C or A(BC). AB
will have partitions A1B1, A2B3 since there is no matching bound of A
for B2 and A is outer relation. A1B1 is associated with bound X1 of A
and C both. A2B3 is associated with bound of X3, which happens to be
2nd bound of A but third of B. When we join (AB) with C, we should
notice that C1 goes with A1B1, C2 doesn't have any matching partition
in AB and C3 goes with A2B3. If we compare bounds of B with C without
any transformation we will know C2 matches B2, but we need to look at
the children of AB to realize that B2 isn't present in any of the
children and thus C2 should not be joined with any partition of AB.
That usually looks a quadratic order operation on the number of
partitions. The complexity can be reduced by maintaining as many
partition bounds as the number of base relations participating in the
join (an idea, I have floated earlier [1]) I don't elaborate it here
to avoid digression. There's also the complexity of an N-way join with
multiple partition keys and joins on partition keys from different
relations as discussed in [1]. There may be more involved cases, that
I haven't thought about. In short, implementation for 1st and 3rd
optimization together looks fairly complex.

Add to this the 2nd optimization and it becomes still more complex.

In order to keep the patches manageable to implement review and
commit, I am proposing following approach.

1. Implement first optimization on top of the current patches, which
enforces that the partition key datatypes of the joining relations
match. I am right now working on that patch. Do this for INNER join
and OUTER join where partitions are missing on the OUTER side and not
INNER side.
As a side note, the existing partition bound comparison functions are
tied to PartitionKey structure and require complete set of bounds from
partitioned relation. Both of those are not applicable anymore,
PartitionKey structure is not available for join and we have to
compare individual bounds in case of join as against one probe with a
complete set. This refactoring did eat some time.

2. Implement support for OUTER join where partitions can be missing
from either side.

3. Implement support for partition-wise join with different partition key types.

All those implementation will be different patches on top of v18 patches.

Given the complexities involved in 2 and 3, I am not sure which order
I should attack them. I don't have any estimates as to how much time
each of those are going to require. May be a couple of months, but I
am not sure.

Obviously we have to wait till the first commitfest to commit the
first version of the patch. So, based on the status at time, we can
decide what goes in the first commit of this feature and adjust the
patch set accordingly.

Thoughts/comments?

[1] http://www.mail-archive.com/pgsql-hackers@postgresql.org/msg312916.html

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Partition-wise join for join between (declaratively) partitioned tables

Reply via email to