Hi Craig,
Thanks for your explanation. I have to agree with your arguments that in
general case replication of DDL statement using logical decoding seems
to be problematic. But we are mostly considering logical decoding in
quite limited context: replication between two identical Postgres
database nodes (multimaster).
Do you think that it in this case replication of DLL can be done as
sequence of low level operations with system catalog tables
including manipulation with locks? So in your example with ALTER TABLE
statement, can we correctly replicate it to other nodes
as request to set exclusive lock + some manipulations with catalog
tables and data table itself?
If so, instead of full support of DDL in logical decoding we can only:
1. Add option whether to include operations on system catalog tables in
logical replication or not.
2. Make it possible to replicate lock requests (can be useful not only
for DDLs)
I looked how DDL was implemented in BDR and did it in similar way in our
multimaster.
But it is awful: we need to have two different channels for propagating
changes. Additionally, in multimaster we want to enforce cluster wide
ACID. It certainly includes operations with metadata. It will be very
difficult to implement if replication of DML and DDL is done in two
different ways...
Let me ask one more question concerning logical replication: how
difficult it will be from your point of view to support two phase commit
in logical replication? Are there some principle problems?
Thanks in advance,
Konstantin
On 17.02.2016 04:33, Craig Ringer wrote:
On 17 February 2016 at 00:54, Oleg Bartunov <obartu...@gmail.com
<mailto:obartu...@gmail.com>> wrote:
DDL support is what it's missed for now.
TBH, based on experience with DDL replication and deparse in BDR, it's
going to be missing for a while yet too, or at least not
comprehensively present without caveats or exceptions.
Some DDL operations don't translate well to a series of replicatable
actions. The case I hit the most is
ALTER TABLE mytable ADD COLUMN somecolumn sometype NOT NULL DEFAULT
some_function();
This is executed (simplified) by taking an ACCESS EXCLUSIVE lock,
changing the catalogs but not making the changes visible yet,
rewriting the table, and committing to make the rewritten table and
the catalog changes visible.
That won't work well with logical replication. We currently capture
DDL with event triggers and log them to a table for later logical
decoding and replay - that's the "recognised" way. The trouble being
that replaying that statement will result in an unnecessary full table
rewrite on the downstream. Then we have to decode and send stream of
changes to a table called pg_temp_<oid_of_mytable>, truncate the copy
of mytable on the downstream that we just rewrote and apply those rows
instead.
Of course all that only works sensibly if you have exactly one
upstream and the downstream copy of the table is treated as (or
enforced as) read-only.
Improving this probably needs DDL deparse to be smarter. Rather than
just emitting something that can be reconstructed into the SQL text of
the DDL it needs to emit one or more steps that are semantically the
same but allow us to skip the rewrite. Along the lines of:
* ALTER TABLE mytable ADD COLUMN somecolumn sometype;
* ALTER TABLE mytable ALTER COLUMN somecolumn DEFAULT some_function();
* <wait for rewrite data for mytable>
* ALTER TABLE mytable ALTER COLUMN somecolumn NOT NULL;
Alternately the downstream would need a hook that lets it intercept
and prevent table rewrites caused by ALTER TABLE and similar. So it
can instead just do a truncate and wait for the new rows to come from
the master.
Note that all this means the standby has to hold an ACCESS EXCLUSIVE
lock on the table during all of replay. That shouldn't be necessary,
all we really need is an EXCLUSIVE lock since concurrent SELECTs are
fine. No idea how to do that.
Deparse is also just horribly complicated to get right. There are so
many clauses and subclauses and variants of statements. Each of which
must be perfect.
Not everything has a simple and obvious mapping on the downstream side
either. TRUNCATE ... CASCADE is the obvious one. You do a cascade
truncate on the master - do you want that to replicate as a cascaded
truncate on the replica, or a truncate of only those tables that
actually got truncated on the master? If the replica has additional
tables with FKs pointing at tables replica the TRUNCATE would truncate
those too if you replicate it as CASCADE; if you don't the truncate
will fail instead. Really, both are probably wrong as far as the user
is concerned, but we can't truncate just the tables truncated on the
master, ignore the FK relationships, and leave dangling FK references
either.
All this means that DDL replication is probably only going to make
sense in scenarios where there's exactly one master and the replica
obeys some rules like "don't create FKs pointing from non-replicated
tables to tables replicated from somewhere else". A concept we
currently have no way to express or enforce like we do
persistent-to-UNLOGGED FKs.
Then there's global objects. Something as simple as:
CREATE ROLE fred;
CREATE TABLE blah(...) OWNER fred;
will break replication because we only see the CREATE TABLE, not the
CREATE ROLE. If we instead replayed the CREATE ROLE and there were
multiple connections between different DBs on an upstream and
downstream apply would fail on all but one. But we can't anyway since
there's no way to capture that CREATE ROLE from any DB except the one
it was executed in, which might not even be one of the ones doing
replication.
I strongly suspect we'll need logical decoding to be made aware of
such global DDL and decode it from the WAL writes to the system
catalogs. Which will be fun - but at least modifications to the shared
catalogs are a lot simpler than the sort of gymnastics done by ALTER
TABLE, etc.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company