Approaches for Supporting EXCEPT in Partitioned Tables
------------------------------------------------------------------------
In an offline discussion with Peter Smith, Amit, and Shlok, we
identified several approaches for supporting EXCEPT with partitioned
tables and their partitions. I’d like to hear others’ opinions on
these approaches.
Consider the following partition hierarchy:
tab_root
├─ tab_part_1
│ ├─ tab_part_1_p1
│ └─ tab_part_1_p2
└─ tab_part_2
├─ tab_part_2_p1
└─ tab_part_2_p2
Approach 1:
---------------------------------
If we exclude a table, then the data in that table and all of its
partitions (i.e., the entire subtree under that table) should not be
replicated.
For example EXCEPT (tab_part_1) skips replication of tab_part_1 and
all of its partitions.
This behaviour remains the same with or without
publish_via_partition_root. The publish_via_partition_root flag only
affects publish_via_relid, i.e., the relation through which data is
published.
This approach involves certain implementation challenges. For brevity,
these are documented in the attached 'Approach1_challenges' document.
Approach 2:
---------------------------------------------------
Assign meaning to ONLY and '*' for partition tables in the EXCEPT
list. In HEAD, ONLY and '*' do not have any meaning for partitioned
tables or partitions, and these keywords are currently ignored.
Examples:
1. EXCEPT (ONLY tab_part_1) skips replication of only the table
tab_part_1. Changes for tab_root, tab_part_1_p1, and tab_part_1_p2 are
still replicated.
ii. EXCEPT (tab_part_1*) skips replication of tables tab_part_1,
tab_part_1_p1, and tab_part_1_p2
The challenges described in Approach 1, particularly around tablesync
handling and COPY behaviour, would still need to be addressed under
this approach as well. ONLY or '*' with partitioned tables is not
supported in HEAD, supporting it specifically for ALL TABLES EXCEPT
may introduce additional confusion for users.
Approach 3:
----------------
Do not allow partitions to be specified in the EXCEPT clause.
Only EXCEPT (tab_root) is supported, which excludes tab_root and all
of its partitions. Specifying EXCEPT (tab_part_1) or EXCEPT
(tab_part_1_p1) will result in an error.
~~
While Approach 1 and Approach 2 offer more flexibility to the user
compared to Approach 3, they also introduce additional design
complexity which does not seem simpler to address.
Approach 3, OTOH, provides a simpler solution and removes ambiguity
around the meaning of EXCEPT. It also aligns, to some extent, with the
behavior of ALTER PUBLICATION … DROP TABLE. For example, when a
publication is created explicitly for a ROOT partitioned table,
ALTER PUBLICATION DROP TABLE is not allowed for a partition,
regardless of the publish_via_partition_root setting. This behavior is
demonstrated in the attached AlterPubDropTable file.
That said, the behaviors of DROP TABLE and EXCEPT (table) are not
directly comparable. EXCEPT (table) is supported only for ALL TABLES
publications, whereas DROP TABLE is not supported in that context.
Similarly, when a publication is created explicitly for a partition,
ALTER PUBLICATION does allow dropping that partition.
The key point here is that 'publish_via_partition_root' does not
influence what can or cannot be dropped using ALTER PUBLICATION … DROP
TABLE. This approach follows the same principle.
OTOH when we see row-filter behaviour, publish_via_partition_root
plays a role in determining whether a row filter can be created
itself. This is demonstrated in the attached RowFilterBehaviour
experiment.
This leads us to Approach 4.
Approach 4:
----------------
Do not allow partitions to be specified in the EXCEPT clause when
publish_via_partition_root is true, but allow them when
publish_via_partition_root is false.
When publish_via_partition_root = true:
EXCEPT (tab_part_1_p1) results in an error.
EXCEPT (tab_part_1) results in an error.
EXCEPT (tab_root) excludes the entire hierarchy under the root table.
When publish_via_partition_root = false:
EXCEPT (tab_part_1_p1) skips replication of that partition.
EXCEPT (tab_part_1) skips replication of that partition and all of its
child partitions.
EXCEPT (tab_root) skips replication of the partitioned table and all
partitions beneath it.
This design is guided by the way row filters behave. It offers more
flexibility than Approach 3 while avoiding the complexity of
Approaches 1 and 2. However, it may cause some user confusion since
EXCEPT behavior depends on publish_via_partition_root.
Given this, the current plan is to implement Approach 3 as the initial
patch, and then implement the changes required for Approach 4 in a
top-up patch. This will allow us to evaluate if at all Approach 4
causes any implementation complexities.
Please share your thoughts on the approaches described.
thanks
Shveta
Consider the following complex partition hierarchy:
Plain Text
tab_root
├── tab_part_1
│ ├── tab_part_1_1
│ │ ├── tab_part_1_1_1
│ │ │ └── tab_part_1_1_1_1
│ │ └── tab_part_1_1_2
│ └── tab_part_1_2
│ ├── tab_part_1_2_1
│ └── tab_part_1_2_2
└── tab_part_2
Behavior on HEAD:
Assume a publication is created on the root table with
publish_via_partition_root = true:
CREATE PUBLICATION pub1 FOR TABLE tab_root WITH (publish_via_partition_root =
true);
In this case:
pg_publication_rel contains an entry only for tab_root.
pg_publication_tables also contains an entry only for tab_root.
When a subscription is created:
CREATE SUBSCRIPTION sub1 CONNECTION 'host=localhost port=5432 dbname=postgres'
PUBLICATION pub1;
pg_subscription_rel is populated using pg_get_publication_tables, so it also
contains only tab_root .
For each entry in pg_subscription_rel, a tablesync worker is launched.
For each tablesync worker, COPY command is constructed based on the
corresponding info in pg_subscription_rel. For this case following copy command
is constructed.
COPY tab_root TO STDOUT;
The data is then copied to the subscriber using COPY FROM.
-------------------------------------------------------------------------------------
Now, consider the behavior with the patch.
Suppose we create the following publication:
CREATE PUBLICATION pub1 FOR ALL TABLES EXCEPT (tab_part_1_1) WITH
(publish_via_partition_root = true);
Following Approach 1, we cannot have only a single entry for tab_root in
pg_publication_tables, because tab_part_1_1 is explicitly excluded and its and
its partitions' data must not be replicated.
Instead, pg_publication_tables should contain entries corresponding to the
remaining relevant tables, such as:
tab_part_1_2
tab_part_2
(The exact set of tables still needs to be finalized.)
Next, pg_subscription_rel needs to be populated in a way that allows tablesync
to:
1. Construct COPY commands only for tab_part_1_2 and tab_part_2 on the publisher
2. Apply the copied data into the root table tab_root on the subscriber, since
publish_via_partition_root is enabled
I think the main task for this scenario is to decide the what should be the
content of pg_publication_tables.
And what should be populated in pg_subscription_rel to construct appropriate
COPY query in tablesync worker.
Row filter behaviour on HEAD
========================
OBSERVATION:
When publish_via_partition_root is true, a row filter can be created for both
the partitioned table and its partitions. The filter defined on the partitioned
table is applied to all partitions, while a filter defined on a specific
partition applies only to that partition.
When publish_via_partition_root is false, creating a publication with a row
filter on the partitioned table itself fails. Row filters can still be created
for individual partitions, but they are applied only to the corresponding
partitions.
TEST for better understanding
---------------
CREATE TABLE parent (id int primary key, val int) PARTITION BY RANGE (id);
CREATE TABLE child1 PARTITION OF parent FOR VALUES FROM (0) TO (100);
CREATE TABLE child2 PARTITION OF parent FOR VALUES FROM (100) TO (200);
<publish_via_partition_root =true>
Test A: Root-level filter
--------------------
CREATE PUBLICATION pub_root FOR TABLE parent WHERE (val > 10) WITH
(publish_via_partition_root = true);
CREATE SUBSCRIPTION sub1 connection 'dbname=postgres host=localhost user=shveta
port=5433' publication pub_root;
INSERT INTO parent VALUES (50, 5); -- not published
INSERT INTO parent VALUES (60, 20); -- published
INSERT INTO child1 VALUES (70, 5); -- not published
INSERT INTO child1 VALUES (80, 20); -- published
INSERT INTO child2 VALUES (150, 5); -- not published
INSERT INTO child2 VALUES (160, 20); -- published
Conclusion: When publish_via_partition_root=true, the row filter defined on the
root table is applied to all partitions.
Test B: Partition-level filter
--------------------
drop publication pub_root;
drop subscription sub1;
truncate parent;
CREATE PUBLICATION pub_part FOR TABLE child1 WHERE (val > 10) WITH
(publish_via_partition_root = true);
create subscription sub2 connection 'dbname=postgres host=localhost user=shveta
port=5433' publication pub_part;
INSERT INTO parent VALUES (50, 5); -- not published
INSERT INTO parent VALUES (60, 20); -- published
INSERT INTO child1 VALUES (70, 5); -- not published
INSERT INTO child1 VALUES (80, 20); -- published
INSERT INTO child2 VALUES (150, 5); -- not published (as publication is only
for child1)
INSERT INTO child2 VALUES (160, 20); -- not published (as publication is only
for child1)
Conclusion: When publish_via_partition_root=true, the row filter defined on the
partition is applied to that parition.
------------
Now test above publish_via_partition_root as false
------------
drop publication pub_part;
drop subscription sub2;
truncate parent;
Test A: Root-level filter
--------------------
CREATE PUBLICATION pub_root FOR TABLE parent WHERE (val > 10) WITH
(publish_via_partition_root = false);
create subscription sub1 connection 'dbname=postgres host=localhost user=shveta
port=5433' publication pub_root;
postgres=# CREATE PUBLICATION pub_root FOR TABLE parent WHERE (val > 10) WITH
(publish_via_partition_root = false);
ERROR: cannot use publication WHERE clause for relation "parent"
DETAIL: WHERE clause cannot be used for a partitioned table when
publish_via_partition_root is false.
Test B: Partition-level filter
-----------------------
CREATE PUBLICATION pub_part FOR TABLE child1 WHERE (val > 10) WITH
(publish_via_partition_root = false);
create subscription sub2 connection 'dbname=postgres host=localhost user=shveta
port=5433' publication pub_part;
INSERT INTO parent VALUES (70, 5); -- not published
INSERT INTO parent VALUES (80, 20); -- published
INSERT INTO child1 VALUES (85, 5); -- not published
INSERT INTO child1 VALUES (90, 20); -- published
INSERT INTO parent VALUES (180, 20); -- not published (as publication is only
for child1)
Conclusion: When publish_via_partition_root=false, the row filter defined on
the partition is applied to that parition.
tab_root (RANGE range_col)
├── tab_part_1 (1–1000) PARTITION BY RANGE (i)
│ ├── tab_part_1_p1 (i 0–500)
│ └── tab_part_1_p2 (i 500–1000)
└── tab_part_2 (1000–2000) PARTITION BY RANGE (i)
├── tab_part_2_p1 (i 0–500)
└── tab_part_2_p2 (i 500–1000)
CREATE TABLE tab_root (range_col int, i int, j int) PARTITION BY RANGE
(range_col);
CREATE TABLE tab_part_1 PARTITION OF tab_root FOR VALUES FROM (1) TO (1000)
PARTITION BY RANGE (i);
CREATE TABLE tab_part_2 PARTITION OF tab_root FOR VALUES FROM (1000) TO (2000)
PARTITION BY RANGE (i);
-- Leaf partitions under tab_part_1
CREATE TABLE tab_part_1_p1 PARTITION OF tab_part_1 FOR VALUES FROM (0) TO (500);
CREATE TABLE tab_part_1_p2 PARTITION OF tab_part_1 FOR VALUES FROM (500) TO
(1000);
-- Leaf partitions under tab_part_2
CREATE TABLE tab_part_2_p1 PARTITION OF tab_part_2 FOR VALUES FROM (0) TO (500);
CREATE TABLE tab_part_2_p2 PARTITION OF tab_part_2 FOR VALUES FROM (500) TO
(1000);
CREATE PUBLICATION pub1 for table tab_root;
postgres=# SELECT c.relname, p.pubname FROM pg_publication_rel pr JOIN pg_class
c ON c.oid = pr.prrelid JOIN pg_publication p ON p.oid = pr.prpubid;
relname | pubname
----------+---------
tab_root | pub1
(1 row)
postgres=# select * from pg_publication_tables;
pubname | schemaname | tablename | attnames | rowfilter
---------+------------+---------------+-----------------+-----------
pub1 | public | tab_part_1_p1 | {range_col,i,j} |
pub1 | public | tab_part_1_p2 | {range_col,i,j} |
pub1 | public | tab_part_2_p1 | {range_col,i,j} |
pub1 | public | tab_part_2_p2 | {range_col,i,j} |
(4 rows)
postgres=# \dRp+
Publication pub1
Owner | All tables | All sequences | Inserts | Updates | Deletes | Truncates
| Generated columns | Via root
--------+------------+---------------+---------+---------+---------+-----------+-------------------+----------
shveta | f | f | t | t | t | t
| none | f
Tables:
"public.tab_root"
--The partitions which were not part of pg_publication_rel, it did not consider
it as included in publication
postgres=# alter publication pub1 drop table only tab_part_2;
ERROR: relation "tab_part_2" is not part of the publication
postgres=# alter publication pub1 drop table only tab_part_1_p1;
ERROR: relation "tab_part_1_p1" is not part of the publication
--It only let tab_root dropped. Even if I gave ONLY, all parititons were also
dropped to be published.
postgres=# alter publication pub1 drop table only tab_root;
ALTER PUBLICATION
postgres=# SELECT c.relname, p.pubname FROM pg_publication_rel pr JOIN pg_class
c ON c.oid = pr.prrelid JOIN pg_publication p ON p.oid = pr.prpubid;
relname | pubname
---------+---------
(0 rows)
postgres=# select * from pg_publication_tables;
pubname | schemaname | tablename | attnames | rowfilter
---------+------------+-----------+----------+-----------
(0 rows)
=================================
drop publication pub1;
--Now try with PUBLISH_VIA_PARTITION_ROOT as true:
create publication pub1 for table tab_root with (
PUBLISH_VIA_PARTITION_ROOT=true);
postgres=# \dRp+
Publication pub1
Owner | All tables | All sequences | Inserts | Updates | Deletes | Truncates
| Generated columns | Via root
--------+------------+---------------+---------+---------+---------+-----------+-------------------+----------
shveta | f | f | t | t | t | t
| none | t
Tables:
"public.tab_root"
postgres=# SELECT c.relname, p.pubname FROM pg_publication_rel pr JOIN pg_class
c ON c.oid = pr.prrelid JOIN pg_publication p ON p.oid = pr.prpubid;
relname | pubname
----------+---------
tab_root | pub1
(1 row)
--This output is different as compared to PUBLISH_VIA_PARTITION_ROOT=false case.
postgres=# select * from pg_publication_tables;
pubname | schemaname | tablename | attnames | rowfilter
---------+------------+-----------+-----------------+-----------
pub1 | public | tab_root | {range_col,i,j} |
--Rest of the output is same as that of PUBLISH_VIA_PARTITION_ROOT=false:
postgres=# alter publication pub1 drop table only tab_part_2;
ERROR: relation "tab_part_2" is not part of the publication
postgres=# alter publication pub1 drop table only tab_part_1_p1;
ERROR: relation "tab_part_1_p1" is not part of the publication
postgres=# alter publication pub1 drop table only tab_part_1;
ERROR: relation "tab_part_1" is not part of the publication
postgres=# alter publication pub1 drop table only tab_root;
ALTER PUBLICATION
postgres=# SELECT c.relname, p.pubname FROM pg_publication_rel pr JOIN pg_class
c ON c.oid = pr.prrelid JOIN pg_publication p ON p.oid = pr.prpubid;
relname | pubname
---------+---------
(0 rows)
postgres=# select * from pg_publication_tables;
pubname | schemaname | tablename | attnames | rowfilter
---------+------------+-----------+----------+-----------
(0 rows)