On 2016/02/23 22:51, Robert Haas wrote: > On Thu, Feb 18, 2016 at 11:11 AM, Amit Langote wrote: >> Some might think that writing potentially the same PARTITION BY clause 100 >> times for 100 level-1 partitions could be cumbersome. That is what >> SUBPARTITION BY notation may be good as a shorthand for. > > I think that anybody who is doing that much partitioning is going to > use a tool to generate the DDL anyway, so it doesn't really matter. >>> I think if you've got SUBPARTITION as a keyword in the syntax >>> anywhere, you're doing it wrong. The toplevel object shouldn't really >>> care whether its children are themselves partitioned or not. >> >> This is fine for the internals. SUBPARTITION BY is mere syntax sugar. It >> might as well be just cascaded PARTITION BYs. The point is to specify as >> many of those with CREATE TABLE toplevel as the number of levels of >> partitioning we want. That does however fix the number of levels in advance. > > Which I think is not good. If we say that a table can be partitioned, > and a table that is a partition can also be partitioned, we've got a > nice general system. Fixing the number of partitioning levels for the > sake of a little syntactic sugar is, IMHO, getting our priorities > backwards.
OK. To reiterate the syntax: CREATE TABLE parent(...) PARTITION BY CREATE TABLE partition PARTITION OF parent FOR VALUES ... [ PARTITION BY ] ALTER TABLE partitioned ATTACH PARTITION name FOR VALUES ... USING TABLE source_table [ NO VALIDATE ] ALTER TABLE partitioned DETACH PARTITION name [ WITH new_name ] A note about NO VALIDATE in ATTACH PARTITION: If specified, it means user is telling the system to "trust" that none of the rows contained in the source table lie outside partition boundary specification (the FOR VALUES clause). Which is not the same thing as adding a NOT VALID check constraint because the check constraint is assumed invalid by the optimizer until explicitly validated by VALIDATE CONSTRAINT. The default behavior is to validate by scanning the source table to check for violating rows and fail adding the partition, if any. Because adding a partition requires an exclusive lock on the parent, the default behavior may cause us to have the lock for long durations which may be undesirable. Should do better than that? >> While we are on the syntax story, how about FOR VALUES syntax for range >> partitions (sorry for piggybacking it here in this message). At least some >> people dislike LESS THAN notation. Corey Huinker says we should be using >> range type literals for that. It's not clear though that using range type >> literals directly is a way ahead. For one, range type API expects there to >> exist a range type with given element type. Whereas all we require for >> range partitioning proper is for column type to have a btree operator >> class. Should we require it to have an associated range type as well? >> Don't think that there exists an API to check that either. All in all, >> range types are good to implement things in applications but not so much >> within the backend (unless I'm missing something). I know reinventing the >> wheel is disliked as well but perhaps we could offer something like the >> following because Corey offered some examples which would help from the >> flexibility: >> >> START [ EXCL ] (startval) END [ INCL ] (endval) > > I don't think using range type literals is gonna work. There's no > guarantee that the necessary range types exist. However, we could > possibly use a syntax inspired by the syntax for range types. I'm a > little nervous about asking people to type commands with mismatching > braces: > > CREATE TABLE partition_name PARTITION OF parent_name FOR VALUES [ 1, 10 ); > > ...but maybe it's a good idea. It certainly has the advantage of > being more compact than a lot of there ways we might choose to write > that. And I think LESS THAN sucks. It's just clunky and awkward > syntax. Slightly concerned about multi-column range partition key but as suggested by Corey, we can use record-like notation. CREATE TABLE foo(c1 char(2), c2 char(2)) PARTITION BY RANGE (c1, c2); CREATE TABLE foo_ax1x PARTITION OF foo FOR VALUES [ ('a', 1), ('a', 2) ); CREATE TABLE foo_ax2x PARTITION OF foo FOR VALUES [ ('a', 2), ('a', 3) ); CREATE TABLE foo_ax3x PARTITION OF foo FOR VALUES [ ('a', 3), ('a', 4) ); CREATE TABLE foo_bx1x PARTITION OF foo FOR VALUES [ ('b', 1), ('b', 2) ); CREATE TABLE foo_bx2x PARTITION OF foo FOR VALUES [ ('b', 2), ('b', 3) ); CREATE TABLE foo_bx3x PARTITION OF foo FOR VALUES [ ('b', 3), ('b', 4) ); >> I see the trade-off. I agree that considering the significance for attach >> partition case is quite important. >> >> So, the tuple routing code should be ready to use projection if there >> happens to be a partition with differing tuple descriptor. In the code I >> posted, a ResultRelInfo is lazily built afresh for each inserted tuple in >> ExecInsert's case and for each tuple where the chosen partition is >> different from the previous tuple's in CopyFrom's case. One can feel that >> there is a certain overhead to that approach for the bulk-loading case >> (almost every CopyFrom invocation). Now if projection enters this picture, >> we have to consider that too. Should we initialize ResultRelInfo's and >> corresponding ProjectionInfo's for all partitions beforehand? Consider how >> ModifyTable currently handles update on inheritance set, for example. That >> would incur unnecessary overhead if only a single tuple is inserted. But >> it would certainly help bulk-loading case. Am I missing something? > > I'd say you are missing benchmarking data. :-) > > Why should the single-tuple case be harmed here? I think setting up N ResultRelInfos in advance where the tuple would only ever require one might be superfluous. But that may point to some flaw in my original design or thinking about the case. As I said, my approach so far is to "lazily" set up a ResultRelInfo once a valid leaf partition for the tuple is found and OID returned. Now there will also be a ProjectionInfo if the target partition's physical descriptor happens to differ from the root table. All we ever start with in ExecInsert() is root table's ResultRelInfo which comes with executor state related to partition key (expression states, if any) and information about root table's partitions. ResultRelInfos for actual partitions that the tuple is inserted into are ephemeral and are created for ExecInsertIndexTuples's perusal only, that is, switched back to root table's right after inserting index tuples. Maybe, there should be an array in EState of ResultRelInfos and ProjectionInfos of size N where N is the number of leaf partitions and indexed by leaf partition sequence number. The sequence number would be based on OID order. That means the partition module now returns the sequence number instead of OID. I indeed should collect some performance numbers for any resulting implementation. >> Consider that we create partitions as inheritance children, that is, >> creating a partition using: >> >> CREATE TABLE partition PARTITION OF parent FOR VALUES ... >> >> is equivalent to saying >> >> CREATE TABLE partition (...) INHERITS (parent) >> >> except that the latter allows partition to have its own column >> definitions, table constraints and multiple parents as long as things are >> conflict-free. The former requires specifying partition bounding values. >> The common thing between the two then is StoreCatalogInheritance(partOid, >> parentOid) that will mark partition as inheritance child of parent in >> pg_inherits. So, our new partitions are really inheritance children but >> that's not apparent to users (where this last bit is important). > > So far, this sounds good to me. > >> Then consider ALTER TABLE partition - should we need to handle it in way >> different from existing inheritance code would do - >> >> * Prevent something whereas regular inheritance wouldn't? >> * Do something instead of/in addition to whatever regular inheritance does? >> >> Consider adding a column to partition - regular inheritance wouldn't >> prevent it because adding a column to child table doesn't get in the way >> of how inheritance works. Now we can make it so that if table is a >> partition created with CRATE TABLE PARTITION OF (which we can tell because >> the table has pg_partition entry), we should *actively* prevent adding a >> column. Inheritance already prevents dropping an inherited attribute in >> child but says "cannot drop inherited column". IMO in a partition's case, >> it should say "cannot drop columns of a partition". > > Agreed. So maybe pg_inherits gets an additional column that indicates > whether this is "regular" inheritance or "partitioning" inheritance, > and in the latter case there are additional restrictions. So far how I have thought it would work is that "partitions" have a pg_partition entry. So, wherever we want to check if an inheritance child is a partition, we just look up the OID in the syscache on pg_partition.partrelid. >> OTOH, adding a column to parent propagates to all partitions by way of >> ATSimpleRecursion(). But not everything that ATSimpleRecursion() does is >> necessary for partitioned tables (which we can tell from relkind or maybe >> because table has pg_partitioned_rel entry) - especially no need to >> propagate column default and check constraint, etc. Because one cannot >> apply insert/update on partitions anyway. > > Hmm. I feel like maybe that should be allowed. If the user wants to > bulk-load a bunch of data and knows that it goes in a particular > partition, why should they have to pay the overhead of determining > that anew for each row? Hm, I thought tuple routing through the parent is the (only) way to ensure that a given partition only contains rows meant for that partition. IOW, there are no CHECK constraints per partition guarding the rows going in. However, there is a provision in the latest patch to "derive" a CHECK constraint by combining the key info and bound info and maybe ExecConstraints() should check those in addition to TupleConstr.has_not_null and TupleConstr.check constraints. Thoughts? Thanks, Amit -- Sent via pgsql-hackers mailing list (firstname.lastname@example.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers