On 18 August 2015 at 11:30, Amit Langote <langote_amit...@lab.ntt.co.jp> wrote:
> > Hi, > > I would like propose $SUBJECT for this development cycle. Attached is a > WIP patch that implements most if not all of what's described below. Some > yet unaddressed parts are mentioned below, too. I'll add this to the > CF-SEP. > > Syntax > ====== > > 1. Creating a partitioned table > > CREATE TABLE table_name > PARTITION BY {RANGE|LIST} > ON (column_list); > > Where column_list consists of simple column names or expressions: > > PARTITION BY LIST ON (name) > PARTITION BY RANGE ON (year, month) > > PARTITION BY LIST ON ((lower(left(name, 2))) > PARTITION BY RANGE ON ((extract(year from d)), (extract(month from d))) > > Note: LIST partition key supports only one column. > > For each column, you could write operator class name: > > PARTITION BY LIST/RANGE ON (colname [USING] opclass_name), > > If not specified, the default btree operator class based on type of each > key column is used. If none of the available btree operator classes are > compatible with the partitioning strategy (list/range), error is thrown. > Built-in btree operator classes cover a good number of types for list and > range partitioning in practical scenarios. > > A table created using this form is of proposed new relkind > RELKIND_PARTITIONED_REL. An entry in pg_partitioned_rel (see below) is > created to store partition key info. > > Note: A table cannot be partitioned after-the-fact using ALTER TABLE. > > Normal dependencies are created between the partitioned table and operator > classes, object in partition expressions like functions. > > 2. Creating a partition of a partitioned table > > CREATE TABLE table_name > PARTITION OF partitioned_table_name > FOR VALUES values_spec; > > Where values_spec is: > > listvalues: [IN] (val1, ...) > > rangevalues: START (col1min, ... ) END (col1max, ... ) > | START (col1min, ... ) > | END (col1max, ... ) > > A table created using this form has proposed pg_class.relispartition set > to true. An entry in pg_partition (see below) is created to store the > partition bound info. > > The values_spec should match the partitioning strategy of the partitioned > table. In case of a range partition, the values in START and/or END should > match columns in the partition key. > > Defining a list partition is fairly straightforward - just spell out the > list of comma-separated values. Error is thrown if the list of values > overlaps with one of the existing partitions' list. > > CREATE TABLE persons_by_state (name text, state text) > PARTITION BY LIST ON (state); > > CREATE TABLE persons_IL > PARTITION OF persons_by_state > FOR VALUES IN ('IL'); > > CREATE TABLE persons_fail > PARTITION OF persons_by_state > FOR VALUES IN ('IL'); > ERROR: cannot create partition that overlaps with an existing one > > For a range partition, there are more than one way: > > Specify both START and END bounds: resulting range should not overlap with > the range(s) covered by existing partitions. Error is thrown otherwise. > Although rare in practice, gaps between ranges are OK. > > CREATE TABLE measurement(logdate date NOT NULL) > PARTITION BY RANGE ON (logdate); > > CREATE TABLE measurement_y2006m02 > PARTITION OF measurement > FOR VALUES START ('2006-02-01') END ('2006-03-01'); --success > > CREATE TABLE measurement_fail > PARTITION OF measurement > FOR VALUES START ('2006-02-15') END ('2006-03-01'); > ERROR: cannot create partition that overlaps with an existing one > > Specify only the START bound: add the partition on the left of some range > covered by existing partitions provided no overlap occurs (also > considering gaps between ranges, if any). If no such range exists, the new > partition will cover the range [START, +INFINITY) and become the rightmost > partition. Error is thrown if the specified START causes overlap. > > CREATE TABLE measurement_y2006m01 > PARTITION OF measurement > FOR VALUES START ('2006-01-01'); --success > > CREATE TABLE measurement_fail > PARTITION OF measurement > FOR VALUES START ('2006-02-01'); --overlaps with measurement_y2006m02 > ERROR: cannot create partition that overlaps with an existing one > > Specify only the END bound: add the partition on the right of some range > covered by existing partitions provided no overlap occurs (also > considering gaps between ranges, if any). If no such range exists, the new > partition would cover the range (-INFINITY, END) and become the leftmost > partition. Error is thrown if the specified END causes overlap. > > CREATE TABLE measurement_y2006m03 > PARTITION OF measurement > FOR VALUES END ('2006-04-01'); --success > > CREATE TABLE measurement_fail > PARTITION OF measurement > FOR VALUES END ('2006-03-01'); --overlaps with measurement_y2006m02 > ERROR: cannot create partition that overlaps with an existing one > > For each partition, START and END bound values are stored in the > catalog. Note that the lower bound is inclusive, whereas the upper bound > is exclusive. > > Note: At most one range partition can have null min bound in which case it > covers the range (-INFINITY, END). Also, at most one range partition can > have null max bound in which case it covers the range [START, +INFINITY). > > A normal dependency is created between the parent and the partition. > > 3. Multi-level partitioning > > CREATE TABLE table_name > PARTITION OF partitioned_table_name > FOR VALUES values_spec > PARTITION BY {RANGE|LIST} ON (columns_list) > > This variant implements a form of so called composite or sub-partitioning > with arbitrarily deep partitioning structure. A table created using this > form has both the relkind RELKIND_PARTITIONED_REL and > pg_class.relispartition set to true. > > 4. (yet unimplemented) Attach partition (from existing table) > > ALTER TABLE partitioned_table > ATTACH PARTITION partition_name > FOR VALUES values_spec > USING [TABLE] table_name; > > ALTER TABLE table_name > SET VALID PARTITION OF <parent>; > > The first of the above pair of commands would attach table_name as a (yet) > 'invalid' partition of partitioned_table (after confirming that it matches > the schema and does not overlap with other partitions per FOR VALUES > spec). It would also record the FOR VALUES part in the partition catalog > and set pg_class.relispartition to true for table_name. > > After the first command is done, the second command would take exclusive > lock on table_name, scan the table to check if it contains any values > outside the boundaries defined by FOR VALUES clause defined previously, > throw error if so, mark as valid partition of parent if not. > > Does that make sense? > > 5. Detach partition > > ALTER TABLE partitioned_table > DETACH PARTITION partition_name [USING table_name] > > This removes partition_name as partition of partitioned_table. The table > continues to exist with the same name or 'table_name', if specified. > pg_class.relispartition is set to false for the table, so it behaves like > a normal table. > > > System catalogs > =============== > > 1. pg_partitioned_rel > > CATALOG(pg_partitioned_rel,oid) BKI_WITHOUT_OIDS > { > Oid partrelid; /* partitioned table pg_class.oid */ > char partstrategy; /* partitioning strategy 'l'/'r' */ > int16 partnatts; /* number of partition columns */ > int2vector partkey; /* column numbers of partition columns; > * 0 where specified column is an > * expresion */ > oidvector partclass; /* operator class to compare keys */ > pg_node_tree partexprs; /* expression trees for partition key > * members that are not simple column > * references; one for each zero entry > * in partkey[] */ > }; > > > 2. pg_partition (omits partisvalid alluded to above) > > CATALOG(pg_partition,oid) BKI_WITHOUT_OIDS > { > Oid partitionid; /* partition oid */ > Oid partparent; /* parent oid */ > anyarray partlistvalues; /* list of allowed values of the only > * partition column */ > anyarray partrangebounds; /* list of bounds of ranges of > * allowed values per partition key > * column */ > }; > > > Further notes > ============= > > There are a number of restrictions on performing after-the-fact changes > using ALTER TABLE to partitions (ie, relispartition=true): > > * Cannot add/drop column > * Cannot set/drop OIDs > * Cannot set/drop NOT NULL > * Cannot set/drop default > * Cannot alter column type > * Cannot add/drop alter constraint (table level) > * Cannot change persistence > * Cannot change inheritance > * Cannot link to a composite type > > Such changes should be made to the topmost parent in the partitioning > hierarchy (hereafter referred to as just parent). These are recursively > applied to all the tables in the hierarchy. Although the last two items > cannot be performed on parent either. > > Dropping a partition using DROP TABLE is not allowed. It needs to detached > using ALTER TABLE on parent before it can be dropped as a normal table. > > Triggers on partitions are not allowed. They should be defined on the > parent. That said, I could not figure out a way to implement row-level > AFTER triggers on partitioned tables (more about that in a moment); so > they are currently not allowed: > > CREATE TRIGGER audit_trig > AFTER INSERT ON persons > FOR EACH ROW EXECUTE PROCEDURE audit_func(); > ERROR: Row-level AFTER triggers are not supported on partitioned tables > > Column/table constraints on partitions are not allowed. They should be > defined on the parent. Foreign key constraints are not allowed due to > above limitation (no row-level after triggers). > > A partitioning parent (RELKIND_PARTITIONED_REL) does not have storage. > > Creating index on parent is not allowed. They should be defined on (leaf) > partitions. Because of this limitation, primary keys are not allowed on a > partitioned table. Perhaps, we should be able to just create a dummy > entry somewhere to represent an index on parent (which every partition > then copies.) Then by restricting primary key to contain all partition key > columns, we can implement unique constraint over the whole partitioned > table. That will in turn allow us to use partitioned tables as PK rels in > a foreign key constraint provided row-level AFTER trigger issue is > resolved. > > VACUUM/ANALYZE on individual partitions should work like normal tables. > I've not implemented something like inheritance tree sampling for > partitioning tree in this patch. Autovacuum has been taught to ignore > parent tables and vacuum/analyze partitions normally. > > Dropping a partitioned table should (?) unconditionally drop all its > partitions probably but, currently the patch uses dependencies, so > requires to specify CASCADE to do the same. > > What should TRUNCATE on partitioned table do? > > Ownership, privileges/permissions, RLS should be managed through the > parent table although not comprehensively addressed in the patch. > > There is no need to define tuple routing triggers. CopyFrom() and > ExecInsert() determine target partition just before performing > heap_insert() and ExecInsertIndexTuples(). IOW, any BR triggers and > constraints (on parent) are executed for tuple before being routed to a > partition. If no partition can be found, it's an error. > > Because row-level AFTER triggers need to save ItemPointers in trigger > event data and defining triggers on partitions (which is where tuples > really go) is not allowed, I could not find a straightforward way to > implement them. So, perhaps we should allow (only) row-level AFTER > triggers on partitions or think of modifying trigger.c to know about this > twist explicitly. > > Internal representations > ======================== > > For a RELKIND_PARTITIONED_REL relations, RelationData gets a few new > fields to store the partitioning metadata. That includes partition key > tuple (pg_partitioned_rel) including some derived info (opfamily, > opcintype, compare proc FmgrInfos, partition expression trees). > > Additionally, it also includes a PartitionInfo object which includes > partition OIDs array, partition bound arrays (if range partitioned, > rangemax is sorted in ascending order and OIDs are likewise ordered). It > is built from information in pg_partition catalog. > > While RelationBuildDesc() initializes the basic key info, fields like > expression trees, PartitionInfo are built on demand and cached. For > example, InitResultRelInfo() builds the latter to populate the newly added > ri_PartitionKeyInfo and ri_Partitions fields, respectively. > > PartitionInfo object is rebuilt on every cache invalidation of the rel > which includes when adding/attaching/detaching a new partition. > > Planner and executor considerations > ===================================== > > The patch does not yet implement any planner changes for partitioned > tables, although I'm working on the same and post updates as soon as > possible. That means, it is not possible to run SELECT/UPDATE/DELETE > queries on partitioned tables without getting: > > postgres=# SELECT * FROM persons; > ERROR: could not open file "base/13244/106975": No such file or directory > > Given that there would be more direct ways of performing partition pruning > decisions with the proposed, it would be nice to utilize them. > Specifically, I would like to avoid having to rely on constraint exclusion > for partition pruning whereby subquery_planner() builds append_rel_list > and the later steps exclude useless partitions. > > By extending RelOptInfo to include partitioning info for partitioned rels, > it might be possible to perform partition pruning directly without > previously having to expand them. Although, as things stand now, it's not > clear how that might work - when would partition RTEs be added to the > rtable? The rtable is assumed not to change after > setup_simple_rel_arrays() has done its job which is much earlier than when > it would be desirable for the partitioned table expansion (along with > partition pruning) to happen. Moreover, if that means we might not be able > to build RelOptInfo's for partitions, how to choose best paths for them > (index paths or not, etc.)? > > I'm also hoping we don't require something like inheritance_planner() for > when partitioned tables are target rels. I assume considerations for why > the special processing is necessary for inheritance trees in that scenario > don't apply to partitioning trees. So, if grouping_planner() returns a > Append plan (among other options) for the partitioning tree, tacking a > ModifyTable node on top should do the trick? > > Suggestions greatly welcome in this area. > > Other items > =========== > > Will include the following once we start reaching consensus on main parts > of the proposed design/implementation: > > * New regression tests > * Documentation updates > * pg_dump, psql, etc. > > For reference, some immediately previous discussions: > > * On partitioning * > > http://www.postgresql.org/message-id/20140829155607.gf7...@eldon.alvh.no-ip.org > > * Partitioning WIP patch * > http://www.postgresql.org/message-id/54ec32b6.9070...@lab.ntt.co.jp > > Comments welcome! > > Thanks, > Amit > Wow, didn't expect to see that email this morning. A very quick test: CREATE TABLE purchases (purchase_id serial, purchase_time timestamp, item text) partition by range on ((extract(year from purchase_time)),(extract(month from purchase_time))); ERROR: referenced relation "purchases" is not a table or foreign table Thom