Re: [HACKERS] Declarative partitioning - another take

Robert Haas Tue, 04 Oct 2016 18:51:07 -0700

On Tue, Oct 4, 2016 at 4:02 AM, Amit Langote
<langote_amit...@lab.ntt.co.jp> wrote:
> [ latest patch set ]


Reviewing 0003:

+      This form attaches an existing table (partitioned or otherwise) as

(which might itself be partitioned)

+      partition of the target table.  Partition bound specification must
+      correspond with the partition method and the key of the target table.

The partition bound specification must correspond to the partitioning
method and partitioning key of the target table.

+      The table being attached must have all the columns of the target table
+      with matching types and no more. Also, it must have all the matching

The table to be attached must have all of the same columns as the
target table and no more; moreover, the column types must also match.

+      with matching types and no more. Also, it must have all the matching
+      constraints as the target table.  That includes both <literal>NOT NULL</>
+      and <literal>CHECK</> constraints.  If some <literal>CHECK</literal>
+      constraint of the table being attached is marked <literal>NO
INHERIT</literal>,
+      the command will fail; such constraints must either be dropped or
+      recreated without the <literal>NO INHERIT</literal> mark.

Why all of these requirements?  We could instead perform a scan to
validate that the constraints are met.  I think the way this should
work is:

1. ATTACH PARTITION works whether matching NOT NULL and CHECK
constraints are present or not.

2. If all of the constraints are present, and a validated constraint
matching the implicit partitioning constraint is also present, then
ATTACH PARTITION does not scan the table to validate constraints;
otherwise, it does.

3. NO VALIDATE is not an option.

+      Currently <literal>UNIQUE</literal>, <literal>PRIMARY KEY</literal>, and
+      <literal>FOREIGN KEY</literal> constraints are not considered, but that
+      might change in the future.

Really?  Changing that sounds impractical to me.

+      This form detaches specified partition of the target table.  The
+      detached partition continues to exist as a standalone table with no ties
+      remaining with the target table.

continues to exist as a standalone table, but no longer has any ties
to the table from which it was detached.

+      Note that if a partition being detached is itself a partitioned table,
+      it continues to exist as such.

You don't really need to say this, I think.  All of the properties of
the detached table are retained, not only its partitioning status.
You wouldn't like it if I told you to document "note that if a
partition being detached is unlogged, it will still be unlogged".

    To add the table as a new child of a parent table, you must own the
-   parent table as well.
+   parent table as well.  That applies to both adding the table as a
+   inheritance child of a parent table and attaching a table as partition to
+   the table.

To add the table as a new child of a parent table, or as a new
partition of a partitioned table, you must own the parent table as
well.

+        The name of the table to attach as a new partition to or
detach from this table.

s/to or/or to/

+    <literal>NO VALIDATE</> option is spcified.

Typo, but see comments above about nuking this altogether.

     A recursive <literal>DROP COLUMN</literal> operation will remove a
     descendant table's column only if the descendant does not inherit
     that column from any other parents and never had an independent
-    definition of the column.  A nonrecursive <literal>DROP
+    definition of the column (which always holds if the descendant table
+    is a partition).  A nonrecursive <literal>DROP
     COLUMN</literal> (i.e., <command>ALTER TABLE ONLY ... DROP
     COLUMN</command>) never removes any descendant columns, but
-    instead marks them as independently defined rather than inherited.
+    instead marks them as independently defined rather than inherited,
+    unless the descendant table is a partition.

This is a hairy explanation.  I suggest that the documentation say
this (and that the code implement it): A nonrecursive DROP TABLE
command will fail for a partitioned table, because all partitions of a
table must have the same columns as the partitioning root.

-    that are not marked <literal>NO INHERIT</>.
+    that are not marked <literal>NO INHERIT</> which are unsupported if
+    the table is a partitioned table.

I think you can omit this hunk.

+   If <literal>PARTITION OF</literal> clause is specified then the table is
+   created as a partition of <literal>parent_table</literal> with specified
+   bounds.  However, unlike regular tables, one cannot specify
+   <literal>PARTITION BY</literal> clause which means foreign tables can
+   only be created as leaf partitions.

I'd delete the sentence beginning with "However".

+   Create foreign table <structname>measurement_y2016m07</>, which will be
+   accessed through the server <structname>server_07</>, that is partition
+   of the range partitioned table <structname>measurement</>:

s/, that is/ as a/

+<phrase>and <replaceable
class="PARAMETER">partition_bound_spec</replaceable> is:</phrase>
+
+FOR VALUES { <replaceable class="PARAMETER">list_spec</replaceable> |
<replaceable class="PARAMETER">range_spec</replaceable> }

I think you can inline the definitions of list_spec and range_spec
here instead of making them separate productions, and I think that
would be preferable.

FOR VALUES { IN ( <replaceable
class="PARAMETER">expression</replaceable> [, ...] ) |
START <replaceable class="PARAMETER">lower-bound</replaceable> [
INCLUSIVE | EXCLUSIVE ] END <replaceable
class="PARAMETER">upper-bound</replaceable> [ INCLUSIVE | EXCLUSIVE ]
}

+      parent table (name optionally schema-qualified).

Parenthetical phrase isn't needed.

+      A partition bound specification must be present and must correspond with
+      partition method and key of the parent table.  It is checked using the
+      specification that the new partition does not overlap with any existing
+      partitions of the parent.

The partition bound specification must correspond to the partitioning
method and partitioning key of the parent table, and must not overlap
with any existing partition of that parent.

+      clause, if any.  Defaults and constraints can optionally be specified
+      for each of the inherited columns, which override those in the parent.

Surely not.  You can't "override" an inherited constraint or an
inherited default.  The child may have extra constraints not present
in the parent, and may have different defaults when it is directly
targeted by an insertion, but it can't possibly override the parent
defaults.

+      One can also specify table constraints, in addition to those inherited
+      from the parent.  Note that all subsequent schema modifications to the
+      parent propagate to partition.

The first part of this seems right, but then what's going on with the
reference to constraints in the previous sentence?  (See previous
review comment.) The second sentence I would delete (but see below).

+     <para>
+      Any data row subsequently inserted into the parent table is mapped to
+      and stored in the partition, provided partition key of the row falls
+      within the partition bounds.
+     </para>

How about: Rows inserted into a partitioned table will be
automatically routed to the correct partition.  If no suitable
partition exists, an error will occur.

+     <para>
+      A partition is dropped or truncated when the parent table is dropped or
+      truncated.  Dropping it directly using <literal>DROP TABLE</literal>
+      will fail; it must first be <firstterm>detached</> from the parent.
+      However, truncating a partition directly works.
+     </para>

How about: A partition must have the same column names and types as
the table of which it is a partition.  Therefore, modifications the
column names or types of the partitioned table will automatically
propagate to all children, as will operations such as TRUNCATE which
normally affect a table and all of its inheritance children.  It is
also possible to TRUNCATE a partition individually, just as for an
inheritance child.

Insisting that you can't drop a child without detaching it first seems
wrong to me.  If I already made this comment and you responded to it,
please point me back to whatever you said.  However, my feeling is
this is flat wrong and absolutely must be changed.

-            if (is_no_inherit)
+
+            /* Discard the NO INHERIT flag if the relation is a partition */
+            if (is_no_inherit && !rel->rd_rel->relispartition)

Something about this seems fishy.  In StoreRelCheck(), you disallow
the addition of a NO INHERIT constraint on a partition, but here you
seem happy to accept it and ignore the NO INHERIT property.  That
doesn't seem consistent.  We should adopt a consistent policy about
what to do about such constraints, and I submit that throwing an error
is better than silently changing things, unless you have some reason
for doing that which I'm not seeing.  Anyway, we should have the same
behavior in both cases.

We should also standardize on what value of conislocal and coninhcount
children end up with; presumably the latter should be 1, but I'm not
sure if the former should be true or false.  In either case, anything
that can vary between children probably needs to be dumped, so let's
enforce that it doesn't so we don't have to dump it.  I'm not sure
whether the code here achieves those objectives, though, and note the
comment in the function header about making sure the logic here
matches MergeConstraintsIntoExisting.

I think the overriding principle here should be: If you attach a table
as a partition, it must not be part of a standard inheritance
hierarchy, and it must not be a partition of any other table.  It can,
however, be partitioned itself.  If you later detach a partition, it
ends up as a standalone table with a copy of each constraint it had as
a partition - probably including the implicit partition constraint.
The DBA can drop those constraints if they're not wanted.

I wonder if it's really a good idea for the partition constraints to
be implicit; what is the benefit of leaving those uncatalogued?

+ * Depending on whether the relation in question is list or range
+ * partitioned, one of the fields is set.
+ */
+typedef struct BoundCollectionData
+{
+    struct ListInfo       *listinfo;
+    struct RangeInfo   *rangeinfo;
+} BoundCollectionData;

This seems like an odd design.  First, when you have a pointer to
either of two things, the normal tool for that in C would be a union,
not a struct. Second, in PostgreSQL we typically handle that by making
both of the things nodes and then you can use IsA() or switch on
nodeTag() to figure out what you've got.  Third, the only place this
is used at least in 0003 is as part of PartitionDescData, which only
has 3 members, so if you were going to do it with 2 members, you could
just include these two members directly.  Considering all of the
foregoing, I'd suggest making this a union and including partstrategy
in PartitionDescData.

I think that the names ListInfo and RangeInfo are far too generic for
something that's specific to partitioning.

+/*
+ * Range bound collection - sorted array of ranges of partitions of a range
+ * partitioned table
+ */
+typedef struct RangeInfo
+{
+    struct PartitionRange    **ranges;
+} RangeInfo;
+
+/* One partition's range */
+typedef struct PartitionRange
+{
+    struct PartitionRangeBound    *lower;
+    struct PartitionRangeBound    *upper;
+} PartitionRange;

This representation doesn't seem optimal, because in most cases the
lower bound of one partition will be the upper bound of the next.  I
suggest that you flatten this representation into a single list of
relevant bounds, each flagged as to whether it is exclusive and
whether it is finite; and a list with one more element of bounds.  For
example, suppose the partition bounds are [10, 20), [20, 30), (30,
40), and [50, 60).  You first construct a list of all of the distinct
bounds, flipping inclusive/exclusive for the lefthand bound in each
case.  So you get:

10 EXCLUSIVE
20 EXCLUSIVE
30 EXCLUSIVE
30 INCLUSIVE
40 EXCLUSIVE
50 EXCLUSIVE
60 EXCLUSIVE

When ordering items for this list, if the same item appears twice, the
EXCLUSIVE copy always appears before INCLUSIVE.  When comparing
against an EXCLUSIVE item, we move to the first half of the array if
we are searching for a value strictly less than that value; when
comparing against an INCLUSIVE item, we move to the first half of the
array if we are searching for a value less than or equal to that
value.

This is a list of seven items, so a binary search will return a
position between 0 (less than all items) and 7 (greater than all
items).  So we need a list of 8 partition mappings, which in this case
will look like this: -1, 0, 1, -1, 2, -1, 3, -1.

In this particular example, there are only two adjacent partitions, so
we end up with 7 bounds with this representation vs. 8 with yours, but
in general I think the gains will be bigger.  If you've got 1000
partitions and they're all adjacent, which is likely, you store 1000
bounds instead of 2000 bounds by doing it this way.

+ * Note: This function should be called only when it is known that 'relid'
+ * is a partition.

Why?  How about "Because this function assumes that the relation whose
OID is passed as an argument will have precisely one parent, it should
only been called when it is known that the relation is a partition."

+    /*
+     * Translate vars in the generated expression to have correct attnos.
+     * Note that the vars in my_qual bear attnos dictated by key which carries
+     * physical attnos of the parent.  We must allow for a case where physical
+     * attnos of a partition can be different from the parent.
+     */
+    partexprs_item = list_head(key->partexprs);
+    for (i = 0; i < key->partnatts; i++)
+    {
+        AttrNumber    attno = key->partattrs[i],
+                    new_attno;
+        char       *attname;
+
+        if (attno != 0)
+        {
+            /* Simple column reference */
+            attname = get_attname(RelationGetRelid(parent), attno);
+            new_attno = get_attnum(RelationGetRelid(rel), attname);
+
+            if (new_attno != attno)
+                my_qual = (List *) translate_var_attno((Node *) my_qual,
+                                                       attno,
+                                                       new_attno);

It can't really be safe to do this one attribute number at a time, or
even if by some good fortune it can't be broken, it at least it seems
extremely fragile.  Suppose that you translate 0 -> 3 and then 3 -> 5;
now the result is garbage.  It's not very efficient to do this one
attno at a time, either.

+    if (classform->relispartition)
+        ereport(ERROR,
+                (errcode(ERRCODE_WRONG_OBJECT_TYPE),
+                 errmsg("\"%s\" is a partition of \"%s\"", rel->relname,
+                        get_rel_name(get_partition_parent(relOid))),
+                 errhint("Use ALTER TABLE DETACH PARTITION to be able
to drop it.")));
+

RangeVarCallbackForDropRelation should do only minimal sanity checks;
defer this until after we have a relation lock.

I didn't get all the way through this patch, so this is a pretty
incomplete review, but it's late here and I'm out of steam for
tonight.  Some general comments:

1. I think that this patch seems to introduce an awful lot of new
structures with confusingly similar names and purposes:
PartitionBoundList, PartitionBoundRange, ListInfo, RangeInfo,
PartitionRange, PartitionList, ListValue, RangePartition.  You've got
4 different structures here that all have "Partition" and "Range" in
the name someplace, including both PartitionRange and RangePartition.
Maybe there's no way around that kind of duplication; after all there
are quite a few moving parts here.  But it seems like it would be good
to try to simplify it.

2. I'm also a bit concerned about the fairly large amount of
apparently-boilerplate code in partition.c, all having to do with how
we create all of these data structures and translate between different
forms of them.  I haven't understood that stuff thoroughly enough to
have a specific idea about how we might be able to get rid of any of
it, and maybe there's no way.  But that too seems like a topic for
futher investigation.  One idea is that maybe some of these things
should be nodes that piggyback on the existing infrastructure in
src/backend/nodes instead of inventing a new way to do something
similar.

3. There are a lot of places where you have separate code for the
range and list partitioning cases, and I'm suspicious that there are
ways that code could be unified.  For example, with list partitioning,
you have a bunch of Datums which represent the specific values that
can appear in the various partitions, and with range partitioning, you
have a bunch of Datums that represent the edges of partitions.  Well,
if you used the same array for both purposes, you could use the same
code to copy it.  That would involve flattening away a lot of the
subsidiary structure under PartitionDescData and pulling up stuff that
is buried lower down into the main structure, but I think that's
likely a good idea anyway - see also point #1.

4. I'm somewhat wondering if we ought to just legislate that the lower
bound is always inclusive and the upper bound is always exclusive.
The decision to support both inclusive and exclusive partition bounds
is responsible for an enormous truckload of complexity in this patch,
and I have a feeling it is also going to be a not-infrequent cause of
user error.

5. I wonder how well this handles multi-column partition keys.  You've
just got one Datum flag and one is-finite flag per partition, but I
wonder if you don't need to track is-finite on a per-column basis, so
that you could partition on (a, b) and have the first partition go up
to (10, 10), the second to (10, infinity), the third to (20, 10), the
fourth to (20, infinity), and the last to (infinity, infinity).  FWIW,
Oracle supports this sort of thing, so perhaps we should, too.  On a
related note, I'm not sure it's going to work to treat a composite
partition key as a record type.  The user may want to specify a
separate opfamily and collation for each column, not just inherit
whatever the record behavior is.  I'm not sure if that's what you are
doing, but the relcache structures don't seem adapted to storing one
Datum per partitioning column per partition, but rather just one Datum
per partition, and I'm not sure that's going to work very well.

Thanks for working on this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Declarative partitioning - another take

Reply via email to