Prompted by a comment in the UPDATE/LIMIT thread, I saw Marko Tiikkaja reference Tom's post http://www.postgresql.org/message-id/1598.1399826...@sss.pgh.pa.us which mentions the possibility of a different partitioning implementation than what we have so far. As it turns out, I've been thinking about partitioning recently, so I thought I would share what I'm thinking so that others can poke holes. My intention is to try to implement this as soon as possible.
Declarative partitioning ======================== In this design, partitions are first-class objects, not normal tables in inheritance hierarchies. There are no pg_inherits entries involved at all. Partitions are a physical implementation detail. Therefore we do not allow the owner to be changed, or permissions to be granted directly to partitions; all these operations happen to the parent relation instead. System Catalogs --------------- In pg_class we have two additional relkind values: * relkind RELKIND_PARTITIONED_REL 'P' indicates a partitioned relation. It is used to indicate a parent table, i.e. one the user can directly address in DML queries. Such relations DO NOT have their own storage. These use the same rules as regular tables for access privileges, ownership and so on. * relkind RELKIND_PARTITION 'p' indicates a partition within a partitioned relation (its parent). These cannot be addressed directly in DML queries and only limited DDL support is provided. They don't have their own pg_attribute entries either and therefore they are always identical in column definitions to the parent relation. Since they are not accessible directly, there is no need for ACL considerations; the parent relation's owner is the owner, and grants are applied to the parent relation only. XXX --- is there a need for a partition having different column default values than its parent relation? Partitions are numbered sequentially, normally from 1 onwards; but it is valid to have negative partition numbers and 0. Partitions don't have names (except automatically generated ones for pg_class.relname, but they are unusable in DDL). Each partition is assigned an Expression that receives a tuple and returns boolean. This expression returns true if a given tuple belongs into it, false otherwise. If a tuple for a partitioned relation is run through expressions of all partitions, exactly one should return true. If none returns true, it might be because the partition has not been created yet. A user-facing error is raised in this case (Rationale: if user creates a partitioned rel and there is no partition that accepts some given tuple, it's the user's fault.) Additionally, each partitioned relation may have a master expression. This receives a tuple and returns an integer, which corresponds to the number of the partition it belongs into. There are two new system catalogs: pg_partitioned_rel --> (prelrelid, prelexpr) pg_partition --> (partrelid, partseq, partexpr, partoverflow) For partitioned rels that have prelexpr, we run that expression and obtain the partition number; as a crosscheck we run partexpr and ensure it returns true. For partitioned rels that don't have prelexpr, we run partexpr for each partition in turn until one returns true. This means that for a properly set up partitioned table, we need to run a single expression on a tuple to find out what partition the tuple belongs into. Per-partition expressions are formed as each partition is created, and are based on the user-supplied partitioning criterion. Master expressions are formed at relation creation time. (XXX Can we change the master expression later, as a result of some ALTER command? Presumably this would mean that all partitions might need to be rewritten.) Triggers -------- (These are user-defined triggers, not partitioning triggers. In fact there are no partitioning triggers at all.) Triggers are attached to the parent relation, not to the specific partition. When a trigger function runs on a tuple inserted, updated or modified on a partition, the data received by the trigger function makes it appear that the tuple belongs to the parent relation. There is no need to let the trigger know which partition the tuple went in or came from. XXX is there a need to give it the partition number that the tuple went it? Syntax ------ CREATE TABLE xyz ( ... ) PARTITION BY RANGE ( a_expr ) This creates the main table only: no partitions are created automatically. We do not support other types of partitioning at this stage. We will implement these later. We do not currently support ALTER TABLE/PARTITION BY (i.e. partition a table after the fact). We leave this as a future improvement. Allowed actions on RELKIND_PARTITIONED_REL: * ALTER TABLE <xyz> CREATE PARTITION <n> This creates a new partition * ALTER TABLE <xyz> CREATE PARTITION FOR <value> Same as above; the partition number is determined automatically. Allowed actions on a RELKIND_PARTITION: * ALTER PARTITION <n> ON TABLE <xyz> SET TABLESPACE * ALTER PARTITION <n> ON TABLE <xyz> DROP * CREATE INDEX .. ON PARTITION <n> ON TABLE <xyz> * VACUUM parent PARTITION <n> As a future extension we will allow partitions to become detached from the parent relation, thus becoming an independent table. This might be a relatively expensive operation: pg_attribute entries need to be created, for example. Overflow Partitions ------------------- There is no explicit concept of overflow partitions. Vacuum, aging ------------- PARTITIONED_RELs, not containing tuples directly, do not have relfrozenxid or relminmxid. Each partition has individual values for these variables. Autovacuum knows to ignore PARTITIONED_RELs, and considers each RELKIND_PARTITION separately. Each partition is vacuumed as a normal relation. Planner ------- A partitioned relation behaves just like a regular relation for purposes of planner. XXX do we need special considerations regarding relation size estimation? For scan plans, we need to prepare Append lists which are used to scan for tuples in a partitioned relation. We can setup fake constraint expressions based on the partitioning expressions, which let the planner discard unnecessary partitions by way of constraint exclusion. (In the future we might be interested in creating specialized plan and execution nodes that know more about partitioned relations, to avoid creating useless Append trees only to prune them later.) Executor -------- When doing an INSERT or UPDATE ResultRelInfo needs to be expanded for partitioned relations: the target relation of an insertion is the parent relation, but the actual partition needs to be resolved at ModifyTable execution time. This means RelOptInfo needs to know about partitions; either we deal with them as "other rels" terms, or we create a new RelOptKind. At any rate, running the partitioning expression on the new tuple would give an partition index. This needs to be done once for each new tuple. I think during ExecInsert, after running triggers and before executing constraints, we need to switch resultRelationDesc from the parent relation into the partition-specific relation. ExecInsertIndexTuples only knows about partitions. It's an error to call it using a partitioned rel. Heap Access Method ------------------ For the purposes of low-level routines in heapam.c, only partitions exist; trying to insert or modify tuples in a RELKIND_PARTITIONED_REL is an error. heap_insert and heap_multi_insert only accept inserting tuples into an individual partition. These routines do not check that the tuples belong into the specific partition; that's responsibility of higher-level code. Because of this, code like COPY will need to make its own checks. Maybe we should offer another API (in between high-level things such as ModifyTable/COPY and heapam.c) that receives tuples into a PARTITIONED_REL and routes them into specific partitions. Note: need to ensure we do not slow down COPY for the regular case of RELKIND_RELATION. Taking backups -------------- pg_dump is able to dump a partitioned relation as a CREATE TABLE/PARTITION command and a series of ALTER TABLE/CREATE PARTITION commands. The data of all partitions is considered a single COPY operation. XXX this limits the ability to restore in parallel. To fix we might consider using one COPY for each partition. It's not clear what relation should be mentioned in such a COPY command, though -- my instinct is that it should reference the parent table only, not the individual partition. Previous Discussion ------------------- http://www.postgresql.org/message-id/d3c4af540703292358s8ed731el7771ab14083aa...@mail.gmail.com Auto Partitioning Patch - WIP version 1 (Nikhil Sontakke, March 2007) http://www.postgresql.org/message-id/20080111231945.gy6...@europa.idg.com.au Declarative partitioning grammar (Gavin Sherry, January 2008) http://www.postgresql.org/message-id/bd8134a40906080702s96c90a9q3bbb581b9bd0d...@mail.gmail.com Patch for automating partitions in PostgreSQL 8.4 Beta 2 (Kedar Potdar, Jun 2009) http://www.postgresql.org/message-id/20091029111531.96cd.52131...@oss.ntt.co.jp Syntax for partitioning (Itagaki Takahiro, Oct 2009) http://www.postgresql.org/message-id/aanlktikp-1_8b04eyik0sdf8ua5kmo64o8sorfbze...@mail.gmail.com Partitioning syntax (Itagaki Takahiro, Jan 2010) Not really related: http://www.postgresql.org/message-id/1199296574.7260.149.ca...@ebony.site Dynamic Partitioning using Segment Visibility Maps (Simon Riggs, January 2008) Still To Be Designed -------------------- * Dependency issues * Are indexes/constraints inherited from the parent rel? * Multiple keys? Subpartitioning? Hash partitioning? Open Questions -------------- * What's the syntax to refer to specific partitions within a partitioned table? We could do "TABLE <xyz> PARTITION <n>", but for example if in the future we add hash partitioning, we might need some non-integer addressing (OTOH assigning sequential numbers to hash partitions doesn't seem so bad). Discussing with users of other DBMSs partitioning feature, one useful phrase is "TABLE <xyz> PARTITION FOR <value>". * Do we want to provide partitioned materialized views? -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers