Re: [HACKERS] Declarative partitioning - another take

Ashutosh Bapat Wed, 23 Nov 2016 21:40:25 -0800

Unsurprisingly ALTER TABLE worked
alter table t1_p1 add primary key(a);
ALTER TABLE
postgres=# \d+ t1_p1
                                   Table "public.t1_p1"
 Column |  Type   | Collation | Nullable | Default | Storage | Stats
target | Description
--------+---------+-----------+----------+---------+---------+--------------+-------------
 a      | integer |           | not null |         | plain   |              |
 b      | integer |           |          |         | plain   |              |
Indexes:
    "t1_p1_pkey" PRIMARY KEY, btree (a)
Inherits: t1


On Thu, Nov 24, 2016 at 11:05 AM, Ashutosh Bapat
<ashutosh.ba...@enterprisedb.com> wrote:
> I am trying to create a partitioned table with primary keys on the
> partitions. Here's the corresponding syntax as per documentation
> CREATE [ [ GLOBAL | LOCAL ] { TEMPORARY | TEMP } | UNLOGGED ] TABLE [
> IF NOT EXISTS ] table_name
>     PARTITION OF parent_table [ (
>   { column_name WITH OPTIONS [ column_constraint [ ... ] ]
>     | table_constraint }
>     [, ... ]
> ) ] partition_bound_spec
>
> IIUC, it should allow "create table t1_p1 partition of t1 (a primary
> key) ...", (a primary key) is nothing but "column_name
> column_constraint", but here's what happens
> create table t1_p1 partition of t1 (a primary key) for values from (0) to 
> (100);
> ERROR:  syntax error at or near "primary"
> LINE 1: create table t1_p1 partition of t1 (a primary key) for value...
>
> The same syntax also suggests using table_constraints but that too doesn't 
> work
>  create table t1_p1 partition of t1 (primary key (a) )  for values
> from (0) to (100);
> ERROR:  inherited relation "t1" is not a table or foreign table
>
> of course t1 is a table, what it isn't?
>
> Am I missing something? How do I define constraints on the partitions?
>
>
> On Tue, Nov 22, 2016 at 2:45 PM, Amit Langote
> <langote_amit...@lab.ntt.co.jp> wrote:
>>
>> Updated patches attached.  I merged what used to be 0006 and 0007 into one.
>>
>> On 2016/11/19 2:23, Robert Haas wrote:
>>> On Fri, Nov 18, 2016 at 5:59 AM, Amit Langote
>>> <langote_amit...@lab.ntt.co.jp> wrote:
>>>> Oh but wait, that means I can insert rows with NULLs in the range
>>>> partition key if I choose to insert it directly into the partition,
>>>> whereas I have been thinking all this while that there could never be
>>>> NULLs in the partition key of a range partition.  What's more,
>>>> get_qual_for_partbound() (patch 0003) emits a IS NOT NULL constraint for
>>>> every partition key column in case of a range partition.  Is that
>>>> wrongheaded altogether?  (also see my reply to your earlier message about
>>>> NULLs in the range partition key)
>>>
>>> The easiest thing to do might be to just enforce that all of the
>>> partition key columns have to be not-null when the range-partitioned
>>> table is defined, and reject any attempt to DROP NOT NULL on them
>>> later.  That's probably better that shoehorning it into the table
>>> constraint.
>>
>> Agreed that forcing range partitioning columns to be NOT NULL during table
>> creation would be a better approach.  But then we would have to reject
>> using expressions in the range partition key, right?
>>
>>>> Thanks for the idea below!
>>>>
>>>>> 1. Forget the idea of a tree.  Instead, let the total number of tables
>>>>> in the partitioning hierarchy be N and let the number of those that
>>
>> [ ... ]
>>
>>>>>
>>>>> No recursion, minimal pointer chasing, no linked lists.  The whole
>>>>> thing is basically trivial aside from the cost of
>>>>> list/range_partition_for_tuple itself; optimizing that is a different
>>>>> project.  I might have some details slightly off here, but hopefully
>>>>> you can see what I'm going for: you want to keep the computation that
>>>>> happens in get_partition_for_tuple() to an absolute minimum, and
>>>>> instead set things up in advance so that getting the partition for a
>>>>> tuple is FAST.  And you want the data structures that you are using in
>>>>> that process to be very compact, hence arrays instead of linked lists.
>>>>
>>>> This sounds *much* better.  Here is a quick attempt at coding the design
>>>> you have outlined above in the attached latest set of patches.
>>>
>>> That shrank both 0006 and 0007 substantially, and it should be faster,
>>> too.   I bet you can shrink them further:
>>
>> Some changes described below have reduced the size to a certain degree.
>>
>>>
>>> - Why is PartitionKeyExecInfo a separate structure and why does it
>>> have a NodeTag?  I bet you can dump the node tag, merge it into
>>> PartitionDispatch, and save some more code and some more
>>> pointer-chasing.
>>
>> OK, I merged the fields of what used to be PartitionKeyExecInfo into
>> PartitionDispatchData as the latter's new fields key and keystate.
>>
>>> - I still think it's a seriously bad idea for list partitioning and
>>> range partitioning to need different code-paths all over the place
>>> here. List partitions support nulls but not multi-column partitioning
>>> keys and range partitions support multi-column partitioning keys but
>>> not nulls, but you could use an internal structure that supports both.
>>> Then you wouldn't need partition_list_values_bsearch and also
>>> partition_rbound_bsearch; you could have one kind of bound structure
>>> that can be bsearch'd for either list or range.  You might even be
>>> able to unify list_partition_for_tuple and range_partition_for_tuple
>>> although that looks a little harder.  In either case, you bsearch for
>>> the greatest value <= the value you have.  The only difference is that
>>> for list partitioning, you have to enforce at the end that it is an
>>> equal value, whereas for range partitioning less-than-or-equal-to is
>>> enough.  But you should still be able to arrange for more code
>>> sharing.
>>
>> I have considered these suggestions in the latest patch.  Now instead of
>> PartitionListInfo, PartitionRangeInfo, and BoundCollectionData structs,
>> there is only one PartitionBoundInfo which consolidates the partition
>> bound information of a partitioned table.  Some of the fields are
>> applicable only to one of list or range case; for example, null-accepting
>> list partition index, infinite status of individual range datums.
>>
>> Also, there is now only one binary search function named
>> partition_bound_bsearch() which invokes a comparison function named
>> partition_bound_cmp().  The former searches a probe (a partition bound or
>> tuple) within a PartitionBoundInfo, which is passed all the way down to
>> the comparison function.
>>
>> Also, we no longer have list_partition_for_tuple() and
>> range_partition_for_tuple().  Instead, in get_partition_for_tuple()
>> itself, there is a bsearch followed by list and range partitioning
>> specific steps based on the returned offset.
>>
>>> - I don't see why you need the bound->lower stuff any more.  If
>>> rangeinfo.bounds[offset] is a lower bound for a partition, then
>>> rangeinfo.bounds[offset+1] is either (a) the upper bound for that
>>> partition and the partition is followed by a "gap" or (b) both the
>>> upper bound for that partition and the lower bound for the next
>>> partition.  With the inclusive/exclusive bound stuff gone, every range
>>> bound has the same sense: if the probed value is <= the bound then
>>> we're supposed to be a lower-numbered partition, but if > then we're
>>> supposed to be in this partition or a higher-numbered one.
>>
>> OK, I've managed to get rid of lower.  At least it is no longer kept in
>> the new relcache struct PartitionBoundInfo.  It is still kept in
>> PartitionRangeBound which is used to hold individual range bounds when
>> sorting them (during relcache build).  Comparisons invoked during the
>> aforementioned sorting step still need to distinguish between lower and
>> upper bounds (such that '1)' < '[1').
>>
>> Tuple-routing no longer needs to look at lower.  In that case, what you
>> described above applies.
>>
>> As a result, one change became necessary: to how we flag individual range
>> bound datum as infinite or not.  Previously, it was a regular Boolean
>> value (either infinite or not) and to distinguish +infinity from
>> -infinity, we looked at whether the bound is lower or upper (the lower
>> flag).  Now, instead, the variable holding the status of individual range
>> bound datum is set to a ternary value: RANGE_DATUM_FINITE (0),
>> RANGE_DATUM_NEG_INF (1), and RANGE_DATUM_POS_INF (2), which still fits in
>> a bool.  Upon encountering an infinite range bound datum, whether it's
>> negative or positive infinity derives the comparison result.  Consider the
>> following example:
>>
>> partition p1 from (1, unbounded) to (1, 1);
>> partition p2 from (1, 1) to (1, 10);
>> partition p3 from (1, 10) to (1, unbounded);
>> partition p4 from (2, unbounded) to (2, 1);
>> ... so on
>>
>> In this case, we need to be able to conclude, say, (1, -inf) < (1, 15) <
>> (1, +inf), so that tuple (1, 15) is assigned to the proper partition.
>>
>> Does this last thing sound reasonable?
>>
>> Thanks,
>> Amit
>
>
>
> --
> Best Wishes,
> Ashutosh Bapat
> EnterpriseDB Corporation
> The Postgres Database Company



-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Declarative partitioning - another take

Reply via email to