Hi Craig,

Thanks for your explanation. I have to agree with your arguments that in general case replication of DDL statement using logical decoding seems to be problematic. But we are mostly considering logical decoding in quite limited context: replication between two identical Postgres database nodes (multimaster).

Do you think that it in this case replication of DLL can be done as sequence of low level operations with system catalog tables including manipulation with locks? So in your example with ALTER TABLE statement, can we correctly replicate it to other nodes as request to set exclusive lock + some manipulations with catalog tables and data table itself?
If so, instead of full support of DDL in logical decoding we can only:

1. Add option whether to include operations on system catalog tables in logical replication or not. 2. Make it possible to replicate lock requests (can be useful not only for DDLs)

I looked how DDL was implemented in BDR and did it in similar way in our multimaster. But it is awful: we need to have two different channels for propagating changes. Additionally, in multimaster we want to enforce cluster wide ACID. It certainly includes operations with metadata. It will be very difficult to implement if replication of DML and DDL is done in two different ways...

Let me ask one more question concerning logical replication: how difficult it will be from your point of view to support two phase commit in logical replication? Are there some principle problems?

Thanks in advance,
Konstantin




On 17.02.2016 04:33, Craig Ringer wrote:
On 17 February 2016 at 00:54, Oleg Bartunov <obartu...@gmail.com <mailto:obartu...@gmail.com>> wrote:


    DDL support is what it's missed for now.


TBH, based on experience with DDL replication and deparse in BDR, it's going to be missing for a while yet too, or at least not comprehensively present without caveats or exceptions.


Some DDL operations don't translate well to a series of replicatable actions. The case I hit the most is

ALTER TABLE mytable ADD COLUMN somecolumn sometype NOT NULL DEFAULT some_function();

This is executed (simplified) by taking an ACCESS EXCLUSIVE lock, changing the catalogs but not making the changes visible yet, rewriting the table, and committing to make the rewritten table and the catalog changes visible.

That won't work well with logical replication. We currently capture DDL with event triggers and log them to a table for later logical decoding and replay - that's the "recognised" way. The trouble being that replaying that statement will result in an unnecessary full table rewrite on the downstream. Then we have to decode and send stream of changes to a table called pg_temp_<oid_of_mytable>, truncate the copy of mytable on the downstream that we just rewrote and apply those rows instead.

Of course all that only works sensibly if you have exactly one upstream and the downstream copy of the table is treated as (or enforced as) read-only.

Improving this probably needs DDL deparse to be smarter. Rather than just emitting something that can be reconstructed into the SQL text of the DDL it needs to emit one or more steps that are semantically the same but allow us to skip the rewrite. Along the lines of:

* ALTER TABLE mytable ADD COLUMN somecolumn sometype;
* ALTER TABLE mytable ALTER COLUMN somecolumn DEFAULT some_function();
* <wait for rewrite data for mytable>
* ALTER TABLE mytable ALTER COLUMN somecolumn NOT NULL;

Alternately the downstream would need a hook that lets it intercept and prevent table rewrites caused by ALTER TABLE and similar. So it can instead just do a truncate and wait for the new rows to come from the master.

Note that all this means the standby has to hold an ACCESS EXCLUSIVE lock on the table during all of replay. That shouldn't be necessary, all we really need is an EXCLUSIVE lock since concurrent SELECTs are fine. No idea how to do that.



Deparse is also just horribly complicated to get right. There are so many clauses and subclauses and variants of statements. Each of which must be perfect.



Not everything has a simple and obvious mapping on the downstream side either. TRUNCATE ... CASCADE is the obvious one. You do a cascade truncate on the master - do you want that to replicate as a cascaded truncate on the replica, or a truncate of only those tables that actually got truncated on the master? If the replica has additional tables with FKs pointing at tables replica the TRUNCATE would truncate those too if you replicate it as CASCADE; if you don't the truncate will fail instead. Really, both are probably wrong as far as the user is concerned, but we can't truncate just the tables truncated on the master, ignore the FK relationships, and leave dangling FK references either.


All this means that DDL replication is probably only going to make sense in scenarios where there's exactly one master and the replica obeys some rules like "don't create FKs pointing from non-replicated tables to tables replicated from somewhere else". A concept we currently have no way to express or enforce like we do persistent-to-UNLOGGED FKs.



Then there's global objects. Something as simple as:

CREATE ROLE fred;

CREATE TABLE blah(...) OWNER fred;

will break replication because we only see the CREATE TABLE, not the CREATE ROLE. If we instead replayed the CREATE ROLE and there were multiple connections between different DBs on an upstream and downstream apply would fail on all but one. But we can't anyway since there's no way to capture that CREATE ROLE from any DB except the one it was executed in, which might not even be one of the ones doing replication.

I strongly suspect we'll need logical decoding to be made aware of such global DDL and decode it from the WAL writes to the system catalogs. Which will be fun - but at least modifications to the shared catalogs are a lot simpler than the sort of gymnastics done by ALTER TABLE, etc.



--
 Craig Ringer http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Reply via email to