Thought a bit more on some points (see below). To anyone interested in getting involved with the review, I'm working on getting a revised version of the patch out soon. However, I must also mention that we need to reach consensus on some broader design issues before any meaningful (or fruitful) code review could be done. Please feel free to post your thoughts about design, syntax, etc.
On 2015/12/22 10:51, Amit Langote wrote: > On 2015/12/18 3:56, Robert Haas wrote: >> On Mon, Dec 14, 2015 at 2:14 AM, Amit Langote wrote: >>> Syntax to create a partitioned table (up to 2 levels of partitioning): >>> >>> CREATE TABLE foo ( >>> ... >>> ) >>> PARTITION BY R/L ON (key0) >>> SUBPARTITION BY R/L ON (key1) >>> [(PARTITION foo_1 FOR VALUES <val> [<storage_params>] [<tblspc>] >>> [(SUBPARTITION foo_1_1 FOR VALUES <val> [<storage_params>] [<tblspc>], >>> ...)], ...)]; >>> [ ... ] >> >> Therefore, I believe it is a whole lot better to make the primary >> syntax for table partitioning something where you issue a CREATE >> statement for the parent and then a CREATE statement for each child. >> If we want to also have a convenience syntax so that people who want >> to create a parent and a bunch of children in one fell swoop can do >> so, fine. > [ ... ] > > By the way, what do you think about SUBPARTITION keyword-based syntax for > multi-level partitioning? Should we instead require that each partition > has its own PARTITION BY in its creation command? If we have a CREATE statement for each partition, how do we generalize that to partitions at different levels? For example, if we use something like the following to create a partition of parent_name: CREATE PARTITION partition_name OF parent_name FOR VALUES ... WITH ... TABLESPACE ... Do we then say: CREATE PARTITION subpartition_name OF partition_name ... to create a level 2 partition (sub-partition) of parent_name? Obviously, as is readily apparent from the command, it is still a direct partition of partition_name for all internal purposes (consider partition list caching in relcache, recursive tuple routing, etc.) save some others. I ask that also because it's related to the choice of syntax to use to declare the partition key for the multi-level case. I'm considering the SUBPARTITION BY notation and perhaps we could generalize it to more than just 2 levels. So, for the above case, parent_name would have been created as: CREATE TABLE parent_name PARTITION BY ... SUBPARTITION BY ... Needless to say, when subpartition_name is created with the command we saw a moment ago, the root partitioned table would be locked. In fact, adding a partition anywhere in the hierarchy needs an exclusive lock on the root table. Also, partition rule (the FOR VALUES clause) would be validated against PARTITION BY or SUBPARTITION BY clause at the respective level. Although, I must admit I feel a little uneasy about the inherent asymmetry in using SUBPARTITION BY for key declaration whereas piggybacking CREATE PARTITION for creating sub-partitions. Is there a better way? > As for the convenience syntax (if at all), how about: > > CREATE TABLE foo ( > ... > ) > PARTITION BY ... ON (...) > SUBPARTITION BY ... ON (...) > opt_partition_list; > > where opt_partition_list is: > > PARTITIONS ( > partname FOR VALUES ... [WITH] [TABLESPACE] opt_subpart_list > [, ...] > ) > > where opt_subpart_list is: > > SUBPARTITIONS ( > subpartname FOR VALUES ... [WITH] [ TABLESPACE] > [, ...] > ) Do we want this at all? It seems difficult to generalize this to multi-level hierarchy of more than 2 levels. >>> What about ALTER TABLE? - Instead of allowing ALTER TABLE to be applied >>> directly to partitions on case-by-case basis (they are tables under the >>> hood after all), we should restrict AT to the master table. Most of the AT >>> changes implicitly propagate from the master table to its partitions. Some >>> of them could be directly applied to partitions and/or sub-partitions such >>> as rename, storage manipulations like - changing tablespace, storage >>> parameters (reloptions), etc.: [ ... ] >> >> I don't think this is a very good idea. This is basically proposing >> that for every DDL command that you can apply to a table, you have to >> spell it differently for a partition. That seems like a lot of extra >> work for no additional functionality. [ ... ] >>> By the way, should we also allow changing the logging of >>> partitions/sub-partitions as follows? >> >> Again, I think you're coming at this from the wrong direction. >> Instead of saying we're going to disallow all changes to the >> partitions and then deciding we need to allow certain changes after >> all, I think we should allow everything that is currently allowed for >> an inherited table and then decide which of those things we need to >> prohibit, and why. For example, if you insist that a child table has >> to have a tuple descriptor that matches the parent, that can improve >> efficiency: Append won't need to project, and so on. But it now >> becomes very difficult to support taking a stand-alone table and >> making it a partition of an existing partitioned table, because the >> set of dropped columns might not match. Having to give an error in >> that case amounts to "we're sorry, we can't attach your partition to >> the partitioning hierarchy because of some invisible state that you >> can't see" isn't very nice. Now I'm not saying that isn't the right >> decision, but I think the design choices here need to be carefully >> thought about. > > Yeah, I am concerned about the ATTACH PARTITION USING TABLE case for the > very point you mention. And I can see how it may result from the > restrictive model I propose. FWIW, other databases impose a number of > restrictions on the partition roll-in case but not sure if for the > internal reasons we might want to. After thinking some more on this - I think that identical tuple descriptors may not just be a nice-to-have but critical in some cases. For example, consider built-in/trigger-less tuple routing. I'd imagine that the partition to insert a tuple into would be determined just before calling heap_insert() in ExecInsert() and CopyFrom(). That means the HeapTuple that is passed to heap_insert() to insert into the partition would be based on the root table's tuple descriptor. Note also that the tuple would have passed through BR, IR triggers, constraints of the root table. When the data is eventually queried from partitions directly, or well even via the root table (considering existing executor capabilities), partition's tuple descriptor at that point had better match the data that went onto the disk. That means we had better keep at least the following things in sync: number of attributes, name, position (attnum), type, notnull-ness of individual attributes. So in order to do that, recursively apply ADD/DROP COLUMN, SET WITH/WITHOUT OIDS, RENAME COLUMN, ALTER COLUMN TYPE, SET/DROP NOT NULL on the root table to all the partitions and prevent those sub-commands to be directly applied to any table (partitions) in the partitioning hierarchy but the root. I further don't see the point of allowing to set (or drop) column defaults in partitions because now INSERT or COPY FROM cannot be directly applied to partitions. Similar argument could be made for BR, IR triggers and CHECK constraints. Am I missing something in all of this? An alternative to doing any of that very well may be to design trigger-less tuple routing to be smarter about possible mismatch of the tuple descriptors but I haven't given that a lot of thought. Is that really an alternative worth looking into? Coming to the ATTACH PARTITION case, I am not quite sure how we can make it as forgiving as ALTER TABLE INHERIT is given that we'd expect things like trigger-less tuple routing to work out-of-the-box as described above. That said, I would not dare propose an INCLUDING DROPPED extension to LIKE or would it stick if I do? ;) Unrelated to the above, I have been thinking about the planner considerations for partition hierarchies with minimal changes considering that we'd want to keep it that way for the first cut: 1) At least some optimizer/prep changes to handle adding partitions to PlannerInfo's append_rel_list would be necessary. 2) To be able to use constraint exclusion, we'd need to create equivalent CHECK constraints (non-dumpable) when partitions are created and perform all the necessary bookkeeping. For time being. You might wonder why (1) requires any special handling - would't "partition hierarchies" still be "inheritance hierarchies" and so would work with existing code albeit with unnecessary overheads like Var translation lists? Let me think out loud about general background on this so someone can correct me if it seems I'm going at it all wrong - On one hand, I think to keep treating "partition hierarchies" as "inheritance hierachies" might have some issues. I am afraid that documented inheritance semantics may not be what we want to keep using for the new partitioned tables. By that, I mean all the user-facing behaviors where inheritance has some bearing. Should it also affect new partitioned tables? Consider whether inheritance semantics would render infeasible some of the things that we'd like to introduce for the new partitioned tables such as automatic tuple routing, or keep us from improving planner smarts and executor capabilities for partitioned tables over what we already have. OTOH, I may be looking at it wrongly. We would not be required to enforce user-facing inheritance behaviors on the new partitioned tables after all. That is to say - it's just that new partitioned tables could still use relevant inheritance infrastructure behind-the-scenes for planning, execution and a few other things and not care about abiding by regular inheritance semantics. I should just go ahead and add special cases in all places where existing inheritance handling code stands to cause trouble down the line for partitioned tables. We might want to mention that we do so somewhere in documentation and also note that regular inheritance semantics does not apply. While it sounds probably fine as implementation for the feature released initially, a day will inevitably come when this behind-the-scenes implementation will be changed to something more amenable to better optimization. But that's for future... There's a great chance that not everyone cares right now about this part of the new partitioning but just want to put it out there. There are more contentious issues like the syntax, partitioning maintenance commands that we plan to support (now or later) and such. Any advice is greatly appreciated. Thanks, Amit -- Sent via pgsql-hackers mailing list (firstname.lastname@example.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers