On Mon, Jul 31, 2023 at 3:06 PM José Neves <rafanev...@msn.com> wrote: > > Hi there, hope to find you well. > > I'm attempting to develop a CDC on top of Postgres, currently using 12, the > last minor, with a custom client, and I'm running into issues with data loss > caused by out-of-order logical replication messages. > > The problem is as follows: postgres streams A, B, D, G, K, I, P logical > replication events, upon exit signal we stop consuming new events at LSN K, > and we wait 30s for out-of-order events. Let's say that we only got A, (and K > ofc) so in the following 30s, we get B, D, however, for whatever reason, G > never arrived. As with pgoutput-based logical replication we have no way to > calculate the next LSN, we have no idea that G was missing, so we assumed > that it all arrived, committing K to postgres slot and shutdown. In the next > run, our worker will start receiving data from K forward, and G is lost > forever... > Meanwhile postgres moves forward with archiving and we can't go back to check > if we lost anything. And even if we could, would be extremely inefficient. > > In sum, the issue comes from the fact that postgres will stream events with > unordered LSNs on high transactional systems, and that pgoutput doesn't have > access to enough information to calculate the next or last LSN, so we have no > way to check if we receive all the data that we are supposed to receive, > risking committing an offset that we shouldn't as we didn't receive yet > preceding data. >
As per my understanding, we stream the data in the commit LSN order and for a particular transaction, all the changes are per their LSN order. Now, it is possible that for a parallel transaction, we send some changes from a prior LSN after sending the commit of another transaction. Say we have changes as follows: T-1 change1 LSN1-1000 change2 LSN2- 2000 commit LSN3- 3000 T-2 change1 LSN1-500 change2 LSN2-1500 commit LSN3-4000 In such a case, all the changes including the commit of T-1 are sent and then all the changes including the commit of T-2 are sent. So, one can say that some of the changes from T-2 from prior LSN arrived after T-1's commit but that shouldn't be a problem because if restart happens after we received partial T-2, we should receive the entire T-2. It is possible that you are seeing something else but if so then please try to share a more concrete example. -- With Regards, Amit Kapila.