We are using logical decoding in multimaster and we are faced with the problem 
that inconsistent transactions are sent to replica.
Briefly, multimaster is using logical decoding in this way:
1. Each multimaster node is connected with each other using logical decoding 
channel and so each pair of nodes 
has its own replication slot.
2. In normal scenario each replication channel is used to replicate only those 
transactions which were originated at the source node.
We are using origin mechanism to skip "foreign" transactions.
2. When offline cluster node is returned back to the multimaster we need to 
recover this node to the current cluster state.
Recovery is performed from one of the cluster's node. So we are using only one 
replication channel to receive all (self and foreign) transactions.
Only in this case we can guarantee consistent order of applying transactions at 
recovered node.
After the end of recovery we need to recreate replication slots with all other 
cluster nodes (because we have already replied transactions from this nodes).
To restart logical decoding we first drop existed slot, then create new one and 
then start logical replication from the WAL position 0/0 (invalid LSN).
In this case recovery should be started from the last consistent point.

The problem is that for some reasons consistent point is not so consistent and 
we get partly decoded transactions.
I.e. transaction body consists of two UPDATE but reorder_buffer extracts only 
the one (last) update and sent this truncated transaction to destination 
causing consistency violation at replica.  I started investigation of logical 
decoding code and found several things which I do not understand.

Assume that we have transactions T1={start_lsn=100, end_lsn=400} and 
T2={start_lsn=200, end_lsn=300}.
Transaction T2 is sent to the replica and replica confirms that flush_lsn=300.
If now we want to restart logical decoding, we can not start with position less 
than 300, because CreateDecodingContext doesn't allow it:

 * start_lsn
 *              The LSN at which to start decoding.  If InvalidXLogRecPtr, 
 *              from the slot's confirmed_flush; otherwise, start from the 
 *              location (but move it forwards to confirmed_flush if it's older 
 *              that, see below).
        else if (start_lsn < slot->data.confirmed_flush)
                 * It might seem like we should error out in this case, but it's
                 * pretty common for a client to acknowledge a LSN it doesn't 
have to
                 * do anything for, and thus didn't store persistently, because 
                 * xlog records didn't result in anything relevant for logical
                 * decoding. Clients have to be able to do that to support 
                 * replication.

So it means that we have no chance to restore T1?
What is worse, if there are valid T2 transaction records with lsn >= 300, then 
we can partly decode T1 and send this T1' to the replica.
I missed something here?

Are there any alternative way to "seek" slot to the proper position without  
actual fetching data from it or recreation of the slot?
Is there any mechanism in xlog which can enforce consistent decoding of 
transaction (so that no transaction records are missed)?
May be I missed something but I didn't find any "record_number" or something 
else which can identify first record of transaction.

Thanks in advance,
Konstantin Knizhnik,
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Reply via email to