Hi Everyone,
I am working on the initial schema sync for Logical replication. Currently,
user have to
manually create a schema on subscriber side. Aim of this feature is to add an
option in
create subscription, so that schema sync can be automatic. I am sharing Design
Doc below,
but there are some corner cases where the design does not work. Please share
your opinion
if design can be improved and we can get rid of corner cases. This design is
loosely based
on Pglogical.
DDL replication is required for this feature.
(https://www.postgresql.org/message-id/flat/CAAD30U%2BpVmfKwUKy8cbZOnUXyguJ-uBNejwD75Kyo%3DOjdQGJ9g%40mail.gmail.com)
SQL Changes:-
CREATE SUBSCRIPTION subscription_name
CONNECTION 'conninfo'
PUBLICATION publication_name [, ...]
[ WITH ( subscription_parameter [= value] [, ... ] ) ]
sync_initial_schema (enum) will be added to subscription_parameter.
It can have 3 values:-
TABLES, ALL , NONE (Default)
In ALL everything will be synced including global objects too.
Restrictions :- sync_initial_schema=ALL can only be used for publication with
FOR ALL TABLES
Design:-
Publisher :-
Publisher have to implement `SHOW CREATE TABLE_NAME`, this table definition
will be used by
subscriber to create exact schema of a table on the subscriber. One alternative
to this can
be doing it on the subscriber side itself, we can create a function similar to
describeOneTableDetails and call it on the subscriber. We also need maintain
same ownership
as of publisher.
It should also have turned on publication of DDL commands.
Subscriber :-
1. In CreateSubscription() when we create replication
slot(walrcv_create_slot()), should
use CRS_EXPORT_SNAPSHOT, So that we can use this snapshot later in the pg_dump.
2. Now we can call pg_dump with above snapshot from CreateSubscription. This
is inside
opts.connect && opts.create_slot if statement. If we fail in this step we have
to drop
the replication slot and create a new one again. Because we need snapshot and
creating a
replication slot is a way to get snapshot. The reason for running pg_dump with
above
snapshot is that we don't want execute DDLs in wal_logs to 2 times. With above
snapshot we
get a state of database which is before the replication slot origin and any
changes after
the snapshot will be in wal_logs.
We will save the pg_dump into a file (custom archive format). So pg_dump will
be similar to
pg_dump --connection_string --schema_only --snapshot=xyz -Fc --file initSchema
If sync_initial_schema=TABLES we dont have to call pg_dump/restore at all.
TableSync process
will take care of it.
3. If we have to sync global objects we need to call pg_dumpall --globals-only
also. But pg_dumpall
does not support --snapshot option, So if user creates a new global object
between creation
of replication slot and running pg_dumpall, that above global object will be
created 2
times on subscriber , which will error out the Applier process.
4. walrcv_disconnect should be called after pg_dump is finished, otherwise
snapshot will
not be valid.
5. Users will replication role cant not call pg_dump , So the replication user
have to
superuser. This is a a major problem.
postgres=# create role s4 WITH LOGIN Replication;
CREATE ROLE
╭─sachin@DUB-1800550165 ~
╰─$ pg_dump postgres -s -U s4
1 ↵
pg_dump: error: query failed: ERROR: permission denied for table t1
pg_dump: detail: Query was: LOCK TABLE public.t1, public.t2 IN ACCESS SHARE MODE
6. pg_subscription_rel table column srsubstate will have one more state
SUBREL_STATE_CREATE 'c'. if sync_initial_schema is enabled we will set
table_state to 'c'.
Above 6 steps will be done even if subscription is not enabled, but connect is
true.
7. Leader Applier process should check if initSync file exist , if true then
it should
call pg_restore. We are not using —pre-data and —post-data segment as it is
used in
Pglogical, Because post_data works on table having data , but we will fill the
data into
table on later stages. pg_restore can be called like this
pg_restore --connection_string -1 file_name
-1 option will execute every command inside of one transaction. If there is any
error
everything will be rollbacked.
pg_restore should be called quite early in the Applier process code, before any
tablesync
process can be created.
Instead of checking if file exist maybe pg_subscription table can be extended
with column
SyncInitialSchema and applier process will check SyncInitialSchema ==
SYNC_PENDING
8. TableSync process should check the state of table , if it is
SUBREL_STATE_CREATE it should
get the latest definition from the publisher and recreate the table. (We have
to recreate
the table even if there are no changes). Then it should go into copy table mode
as usual.
It might seem that TableSync is doing duplicate work already done by
pg_restore. We are doing
it in this way because of concurrent DDLs and refresh publication command.
Concurrent DDL :-
User can execute a DDL command to table t1 at the same time when subscriber is
trying to sync
it. pictorial representation https://imgur.com/a/ivrIEv8 [1]
In tablesync process, it makes a connection to the publisher and it sees the
table state which can be in future wrt to the publisher, which can introduce
conflicts.
For example:-
CASE 1:- { Publisher removed the column b from the table t1 when subscriber was
doing pg_restore
(or any point in concurrent DDL window described in picture [1] ), when
tableSync
process will start transaction on the publisher it will see request data of
table t1
including column b, which does not exist on the publisher.} So that is why
tableSync process
asks for the latest definition.
If we say that we will delay tableSync worker till all the DDL related to table
t1 is
applied by the applier process , we can still have a window when publisher
issues a DDL
command just before tableSync starts its transaction, and therefore making
tableSync and
publisher table definition incompatible (Thanks to Masahiko for pointing out
this race
condition).
Applier process will skip all DDL/DMLs related to the table t1 and tableSync
will apply those
in Catchup phase.
Although there is one issue what will happen to views/ or functions which
depend on the table
. I think they should wait till table_state is > SUBREL_STATE_CREATE (means we
have the latest
schema definition from the publisher).
There might be corner cases to this approach or maybe a better way to handle
concurrent DDL
One simple solution might be to disallow DDLs on the publisher till all the
schema is
synced and all tables have state >= SUBREL_STATE_DATASYNC (We can have CASE 1:
issue ,
even with DDL replication, so we have to wait till all the tables have
table_state
> SUBREL_STATE_DATASYNC). Which might be a big window for big databases.
Refresh publication :-
In refresh publication, subscriber does create a new replication slot hence ,
we can’t run
pg_dump with a snapshot which starts from origin(maybe this is not an issue at
all). In this case
it makes more sense for tableSync worker to do schema sync.
If community is happy with above design, I can start working on prototype.
Credits :- This design is inspired by Pglogical. Also thanks to Zane, Masahiko,
Amit for reviewing earlier designs
Regards
Sachin Kumar
Amazon Web Services