On Mon, Jul 31, 2023 at 3:06 PM José Neves <rafanev...@msn.com> wrote:
>
> Hi there, hope to find you well.
>
> I'm attempting to develop a CDC on top of Postgres, currently using 12, the 
> last minor, with a custom client, and I'm running into issues with data loss 
> caused by out-of-order logical replication messages.
>
> The problem is as follows: postgres streams A, B, D, G, K, I, P logical 
> replication events, upon exit signal we stop consuming new events at LSN K, 
> and we wait 30s for out-of-order events. Let's say that we only got A, (and K 
> ofc) so in the following 30s, we get B, D, however, for whatever reason, G 
> never arrived. As with pgoutput-based logical replication we have no way to 
> calculate the next LSN, we have no idea that G was missing, so we assumed 
> that it all arrived, committing K to postgres slot and shutdown. In the next 
> run, our worker will start receiving data from K forward, and G is lost 
> forever...
> Meanwhile postgres moves forward with archiving and we can't go back to check 
> if we lost anything. And even if we could, would be extremely inefficient.
>
> In sum, the issue comes from the fact that postgres will stream events with 
> unordered LSNs on high transactional systems, and that pgoutput doesn't have 
> access to enough information to calculate the next or last LSN, so we have no 
> way to check if we receive all the data that we are supposed to receive, 
> risking committing an offset that we shouldn't as we didn't receive yet 
> preceding data.
>

As per my understanding, we stream the data in the commit LSN order
and for a particular transaction, all the changes are per their LSN
order. Now, it is possible that for a parallel transaction, we send
some changes from a prior LSN after sending the commit of another
transaction. Say we have changes as follows:

T-1
change1 LSN1-1000
change2 LSN2- 2000
commit   LSN3- 3000

T-2
change1 LSN1-500
change2 LSN2-1500
commit   LSN3-4000

In such a case, all the changes including the commit of T-1 are sent
and then all the changes including the commit of T-2 are sent. So, one
can say that some of the changes from T-2 from prior LSN arrived after
T-1's commit but that shouldn't be a problem because if restart
happens after we received partial T-2, we should receive the entire
T-2.

It is possible that you are seeing something else but if so then
please try to share a more concrete example.

-- 
With Regards,
Amit Kapila.


Reply via email to