Thank you for answers.

> No, you don't need to recreate them. Just advance your replication identifier 
> downstream and request a replay position in the future. Let the existing slot 
> skip over unwanted data and resume where you want to start replay.
> You can advance the replication origins on the peers as you replay forwarded 
> xacts from your master.
> Have a look at how the BDR code does this during "catchup mode" replay.
> So while your problem discussed below seems concerning, you don't have to 
> drop and recreate slots like are currently doing. 

The only reason for recreation of slot is that I want to move it to the current 
"horizont" and skip all pending transaction without explicit specification of 
the restart position.
If I do not drop the slot and just restart replication specifying position 0/0 
(invalid LSN), then replication will be continued from the current slot 
position in WAL, will not it?
So there  is no way to specify something "start replication from the end of 
WAL", like lseek(0, SEEK_END).
Right now I trying to overcome this limitation by explicit calculation of the 
position from which we should continue replication.
But unfortunately the problem  with partly decoded transactions persist.
I will try at next week to create example reproducing the problem without any 
multimaster stuff, just using standard logical decoding plugin.

> To restart logical decoding we first drop existed slot, then create new one 
> and then start logical replication from the WAL position 0/0 (invalid LSN).
> In this case recovery should be started from the last consistent point.
> How do you create the new slot? SQL interface? walsender interface? Direct C 
> calls?

Slot is created by peer node using standard libpq connection with 
database=replication connection string.

> The problem is that for some reasons consistent point is not so consistent 
> and we get partly decoded transactions.
> I.e. transaction body consists of two UPDATE but reorder_buffer extracts only 
> the one (last) update and sent this truncated transaction to destination 
> causing consistency violation at replica.  I started investigation of logical 
> decoding code and found several things which I do not understand.
> Yeah, that sounds concerning and shouldn't happen.

I looked at replication code more precisely and understand that my first 
concerns were wrong.
Confirming flush position should not prevent replaying transactions with 
smaller LSNs.
But unfortunately the problem is really present. May be it is caused by race 
conditions (although most logical decoder data is local to backend).
This is why I will try to create reproducing scenario without multimaster.

> Assume that we have transactions T1={start_lsn=100, end_lsn=400} and 
> T2={start_lsn=200, end_lsn=300}.
> Transaction T2 is sent to the replica and replica confirms that flush_lsn=300.
> If now we want to restart logical decoding, we can not start with position 
> less than 300, because CreateDecodingContext doesn't allow it:
> Right. You've already confirmed receipt of T2, so you can't receive it again.
> So it means that we have no chance to restore T1?
> Wrong. You can, because the slot's restart_lsn still be will be some LSN <= 
> 100. The slot keeps track of inprogress transactions (using xl_running_xacts 
> records) and knows it can't discard WAL past lsn 100 because xact T1 is still 
> in-progress, so it must be able to decode from the start of it.
> When you create a decoding context decoding starts at restart_lsn not at 
> confirmed_flush_lsn. confirmed_flush_lsn is the limit at which commits start 
> resulting in decoded data being sent to you. So in your case, T1 commits at 
> lsn=400, which is >300, so you'll receive the whole xact for T1.

Yeh, but unfortunately it happens. Need to understand why...

> It's all already there. See logical decoding's use of xl_running_xacts.

But how this information is persisted?
What will happen if wal_sender is restarted?

> -- 
>  Craig Ringer         
>  PostgreSQL Development, 24x7 Support, Training & Services

Reply via email to