Re: [HACKERS] Declarative partitioning

Amit Langote Fri, 15 Jan 2016 02:49:52 -0800

Thought a bit more on some points (see below). To anyone interested in
getting involved with the review, I'm working on getting a revised version
of the patch out soon. However, I must also mention that we need to reach
consensus on some broader design issues before any meaningful (or
fruitful) code review could be done. Please feel free to post your
thoughts about design, syntax, etc.

On 2015/12/22 10:51, Amit Langote wrote:
> On 2015/12/18 3:56, Robert Haas wrote:
>> On Mon, Dec 14, 2015 at 2:14 AM, Amit Langote wrote:
>>> Syntax to create a partitioned table (up to 2 levels of partitioning):
>>>
>>> CREATE TABLE foo (
>>> ...
>>> )
>>> PARTITION BY R/L ON (key0)
>>> SUBPARTITION BY R/L ON (key1)
>>> [(PARTITION foo_1 FOR VALUES <val> [<storage_params>] [<tblspc>]
>>> [(SUBPARTITION foo_1_1 FOR VALUES <val> [<storage_params>] [<tblspc>],
>>> ...)], ...)];
>>>

[ ... ]

>>
>> Therefore, I believe it is a whole lot better to make the primary
>> syntax for table partitioning something where you issue a CREATE
>> statement for the parent and then a CREATE statement for each child.
>> If we want to also have a convenience syntax so that people who want
>> to create a parent and a bunch of children in one fell swoop can do
>> so, fine.
> 

[ ... ]

> 
> By the way, what do you think about SUBPARTITION keyword-based syntax for
> multi-level partitioning? Should we instead require that each partition
> has its own PARTITION BY in its creation command?

If we have a CREATE statement for each partition, how do we generalize
that to partitions at different levels? For example, if we use something
like the following to create a partition of parent_name:

CREATE PARTITION partition_name OF parent_name FOR VALUES ...
    WITH ... TABLESPACE ...

Do we then say:

CREATE PARTITION subpartition_name OF partition_name ...

to create a level 2 partition (sub-partition) of parent_name? Obviously,
as is readily apparent from the command, it is still a direct partition of
partition_name for all internal purposes (consider partition list caching
in relcache, recursive tuple routing, etc.) save some others.

I ask that also because it's related to the choice of syntax to use to
declare the partition key for the multi-level case. I'm considering the
SUBPARTITION BY notation and perhaps we could generalize it to more than
just 2 levels. So, for the above case, parent_name would have been created as:

CREATE TABLE parent_name PARTITION BY ... SUBPARTITION BY ...

Needless to say, when subpartition_name is created with the command we saw
a moment ago, the root partitioned table would be locked. In fact, adding
a partition anywhere in the hierarchy needs an exclusive lock on the root
table. Also, partition rule (the FOR VALUES clause) would be validated
against PARTITION BY or SUBPARTITION BY clause at the respective level.

Although, I must admit I feel a little uneasy about the inherent asymmetry
in using SUBPARTITION BY for key declaration whereas piggybacking CREATE
PARTITION for creating sub-partitions. Is there a better way?

> As for the convenience syntax (if at all), how about:
> 
> CREATE TABLE foo (
>   ...
> )
> PARTITION BY ... ON (...)
> SUBPARTITION BY ... ON (...)
> opt_partition_list;
> 
> where opt_partition_list is:
> 
> PARTITIONS (
>   partname FOR VALUES ... [WITH] [TABLESPACE] opt_subpart_list
>   [, ...]
> )
> 
> where opt_subpart_list is:
> 
> SUBPARTITIONS (
>   subpartname FOR VALUES ... [WITH] [ TABLESPACE]
>   [, ...]
> )

Do we want this at all? It seems difficult to generalize this to
multi-level hierarchy of more than 2 levels.

>>> What about ALTER TABLE? - Instead of allowing ALTER TABLE to be applied
>>> directly to partitions on case-by-case basis (they are tables under the
>>> hood after all), we should restrict AT to the master table. Most of the AT
>>> changes implicitly propagate from the master table to its partitions. Some
>>> of them could be directly applied to partitions and/or sub-partitions such
>>> as rename, storage manipulations like - changing tablespace, storage
>>> parameters (reloptions), etc.:

[ ... ]

>>
>> I don't think this is a very good idea.  This is basically proposing
>> that for every DDL command that you can apply to a table, you have to
>> spell it differently for a partition.  That seems like a lot of extra
>> work for no additional functionality.

[ ... ]

>>> By the way, should we also allow changing the logging of
>>> partitions/sub-partitions as follows?
>>
>> Again, I think you're coming at this from the wrong direction.
>> Instead of saying we're going to disallow all changes to the
>> partitions and then deciding we need to allow certain changes after
>> all, I think we should allow everything that is currently allowed for
>> an inherited table and then decide which of those things we need to
>> prohibit, and why.  For example, if you insist that a child table has
>> to have a tuple descriptor that matches the parent, that can improve
>> efficiency: Append won't need to project, and so on.  But it now
>> becomes very difficult to support taking a stand-alone table and
>> making it a partition of an existing partitioned table, because the
>> set of dropped columns might not match.  Having to give an error in
>> that case amounts to "we're sorry, we can't attach your partition to
>> the partitioning hierarchy because of some invisible state that you
>> can't see" isn't very nice.  Now I'm not saying that isn't the right
>> decision, but I think the design choices here need to be carefully
>> thought about.
> 
> Yeah, I am concerned about the ATTACH PARTITION USING TABLE case for the
> very point you mention. And I can see how it may result from the
> restrictive model I propose. FWIW, other databases impose a number of
> restrictions on the partition roll-in case but not sure if for the
> internal reasons we might want to.

After thinking some more on this - I think that identical tuple
descriptors may not just be a nice-to-have but critical in some cases. For
example, consider built-in/trigger-less tuple routing. I'd imagine that
the partition to insert a tuple into would be determined just before
calling heap_insert() in ExecInsert() and CopyFrom(). That means the
HeapTuple that is passed to heap_insert() to insert into the partition
would be based on the root table's tuple descriptor. Note also that the
tuple would have passed through BR, IR triggers, constraints of the root
table. When the data is eventually queried from partitions directly, or
well even via the root table (considering existing executor capabilities),
partition's tuple descriptor at that point had better match the data that
went onto the disk. That means we had better keep at least the following
things in sync: number of attributes, name, position (attnum), type,
notnull-ness of individual attributes. So in order to do that, recursively
apply ADD/DROP COLUMN, SET WITH/WITHOUT OIDS, RENAME COLUMN, ALTER COLUMN
TYPE, SET/DROP NOT NULL on the root table to all the partitions and
prevent those sub-commands to be directly applied to any table
(partitions) in the partitioning hierarchy but the root. I further don't
see the point of allowing to set (or drop) column defaults in partitions
because now INSERT or COPY FROM cannot be directly applied to partitions.
Similar argument could be made for BR, IR triggers and CHECK constraints.
Am I missing something in all of this?

An alternative to doing any of that very well may be to design
trigger-less tuple routing to be smarter about possible mismatch of the
tuple descriptors but I haven't given that a lot of thought. Is that
really an alternative worth looking into?

Coming to the ATTACH PARTITION case, I am not quite sure how we can make
it as forgiving as ALTER TABLE INHERIT is given that we'd expect things
like trigger-less tuple routing to work out-of-the-box as described above.
That said, I would not dare propose an INCLUDING DROPPED extension to LIKE
or would it stick if I do? ;)

Unrelated to the above, I have been thinking about the planner
considerations for partition hierarchies with minimal changes considering
that we'd want to keep it that way for the first cut:

1) At least some optimizer/prep changes to handle adding partitions to
PlannerInfo's append_rel_list would be necessary.

2) To be able to use constraint exclusion, we'd need to create equivalent
CHECK constraints (non-dumpable) when partitions are created and perform
all the necessary bookkeeping. For time being.

You might wonder why (1) requires any special handling - would't
"partition hierarchies" still be "inheritance hierarchies" and so would
work with existing code albeit with unnecessary overheads like Var
translation lists? Let me think out loud about general background on this
so someone can correct me if it seems I'm going at it all wrong -

On one hand, I think to keep treating "partition hierarchies" as
"inheritance hierachies" might have some issues. I am afraid that
documented inheritance semantics may not be what we want to keep using for
the new partitioned tables. By that, I mean all the user-facing behaviors
where inheritance has some bearing. Should it also affect new partitioned
tables? Consider whether inheritance semantics would render infeasible
some of the things that we'd like to introduce for the new partitioned
tables such as automatic tuple routing, or keep us from improving planner
smarts and executor capabilities for partitioned tables over what we
already have.

OTOH, I may be looking at it wrongly. We would not be required to enforce
user-facing inheritance behaviors on the new partitioned tables after all.
That is to say - it's just that new partitioned tables could still use
relevant inheritance infrastructure behind-the-scenes for planning,
execution and a few other things and not care about abiding by regular
inheritance semantics. I should just go ahead and add special cases in all
places where existing inheritance handling code stands to cause trouble
down the line for partitioned tables. We might want to mention that we do
so somewhere in documentation and also note that regular inheritance
semantics does not apply. While it sounds probably fine as implementation
for the feature released initially, a day will inevitably come when this
behind-the-scenes implementation will be changed to something more
amenable to better optimization. But that's for future...

There's a great chance that not everyone cares right now about this part
of the new partitioning but just want to put it out there. There are more
contentious issues like the syntax, partitioning maintenance commands that
we plan to support (now or later) and such.

Any advice is greatly appreciated.

Thanks,
Amit

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Declarative partitioning

Reply via email to