On Thu, Dec 10, 2020 at 3:19 PM Peter Smith <smithpb2...@gmail.com> wrote: > > On Tue, Dec 8, 2020 at 9:14 PM Amit Kapila <amit.kapil...@gmail.com> wrote: > > > > On Tue, Dec 8, 2020 at 11:53 AM Peter Smith <smithpb2...@gmail.com> wrote: > > > > > > Yes, I observed this same behavior. > > > > > > IIUC the only way for the tablesync worker to go from CATCHUP mode to > > > SYNCDONE is via the call to process_sync_tables. > > > > > > But a side-effect of this is, when messages arrive during this CATCHUP > > > phase one tx will be getting handled by the tablesync worker before > > > the process_sync_tables() is ever encountered. > > > > > > I have created and attached a simple patch which allows the tablesync > > > to detect if there is anything to do *before* it enters the apply main > > > loop. Calling process_sync_tables() before the apply main loop offers > > > a quick way out so the message handling will not be split > > > unnecessarily between the workers. > > > > > > > Yeah, this demonstrates the idea can work but as mentioned in my > > previous email [1] this needs much more work to make the COPY and then > > later fetching the changes from the publisher consistently. So, let me > > summarize the discussion so far. We wanted to enhance the tablesync > > phase of Logical Replication to enable decoding of prepared > > transactions [2]. The problem was when we stream prepared transactions > > in the tablesync worker, it will simply commit the same due to the > > requirement of maintaining a single transaction for the entire > > duration of copy and streaming of transactions afterward. We can't > > simply disable the decoding of prepared xacts for tablesync workers > > because it can skip some of the prepared xacts forever on subscriber > > as explained in one of the emails above [3]. Now, while investigating > > the solutions to enhance tablesync to support decoding at prepare > > time, I found that due to the current design of tablesync we can see > > partial data of transactions on subscribers which is also explained in > > the email above with an example [4]. This problem of visibility is > > there since the Logical Replication is introduced in PostgreSQL and > > the only answer I got till now is that there doesn't seem to be any > > other alternative which I think is not true and I have provided one > > alternative as well. > > > > Next, we have discussed three different solutions all of which will > > solve the first problem (allow the tablesync worker to decode > > transactions at prepare time) and one of which solves both the first > > and second problem (partial transaction data visibility). > > > > Solution-1: Allow the table-sync worker to use multiple transactions. > > The reason for doing it in a single transaction is that if after > > initial COPY we commit and then crash while streaming changes of other > > transactions, the state of the table won't be known after the restart > > as we are using temporary slot so we don't from where to restart > > syncing the table. > > > > IIUC, we need to primarily do two things to achieve multiple > > transactions, one is to have an additional state in the catalog (say > > catch up) which will say that the initial copy is done. Then we need > > to have a permanent slot using which we can track the progress of the > > slot so that after restart (due to crash, connection break, etc.) we > > can start from the appropriate position. Now, this will allow us to do > > less work after recovering from a crash because we will know the > > restart point. As Craig mentioned, it also allows the sync slot to > > advance, freeing any held upstream resources before the whole sync is > > done, which is good if the upstream is busy and generating lots of > > WAL. Finally, committing as we go means we won't exceed the cid > > increment limit in a single txn. > > > > Solution-2: The next solution we discussed is to make "tablesync" > > worker just gather messages after COPY in a similar way to how the > > current streaming of in-progress transaction feature gathers messages > > into a "changes" file so that they can be replayed later by the apply > > worker. Now, here as we don't need to replay the individual > > transactions in tablesync worker in a single transaction, it will > > allow us to send decode prepared to the subscriber. This has some > > disadvantages such as each transaction processed by tablesync worker > > needs to be durably written to file and it can also lead to some apply > > lag later when we process the same by apply worker. > > > > Solution-3: Allow the table-sync workers to just perform initial COPY > > and then once the COPY is done for all relations the apply worker will > > stream all the future changes. Now, surely if large copies are > > required for multiple relations then we would delay a bit to replay > > transactions partially by the apply worker but don't know how much > > that matters as compared to transaction visibility issue and anyway we > > would have achieved the maximum parallelism by allowing copy via > > multiple workers. This would reduce the load (both CPU and I/O) on the > > publisher-side by allowing to decode the data only once instead of for > > each table sync worker once and separately for the apply worker. I > > think it will use fewer resources to finish the work. > > > > Currently, in tablesync worker, we create a slot with CRS_USE_SNAPSHOT > > option which creates a transaction snapshot on the publisher, and then > > we use the same snapshot for COPY from the publisher. After this, when > > we try to receive the data from the publisher using the same slot, it > > will be in sync with the COPY. I think to keep the same consistency > > between COPY and the data we receive from the publisher in this > > approach, we need to export the snapshot while creating a slot in the > > apply worker by using CRS_EXPORT_SNAPSHOT and then use the same > > snapshot by all the tablesync workers doing the copy. In tablesync > > workers, we can use the SET TRANSACTION SNAPSHOT command after "BEGIN > > READ ONLY ISOLATION LEVEL REPEATABLE READ" to use the exported > > snapshot. That way the COPY will use the same snapshot as is used for > > receiving the changes in apply worker and the data will be in sync. > > > > Then we also need a way to export snapshot while the apply worker is > > already receiving the changes because users can use 'ALTER > > SUBSCRIPTION name REFRESH PUBLICATION' which allows new tables to be > > synced. I think we need to introduce a new command in > > exec_replication_command() to export the snapshot from the existing > > slot and then use it by the new tablesync worker. > > > > > > Among the above three solutions, the first two will solve the first > > problem (allow the tablesync worker to decode transactions at prepare > > time) and the third solution will solve both the first and second > > problem (partial transaction data visibility). The third solution > > requires quite some redesign of how the Logical Replication work is > > synchronized between apply and tablesync workers and might turn out to > > be a bigger implementation effort. I am tentatively thinking to go > > with a first or second solution at this stage and anyway if later > > people feel that we need some bigger redesign then we can go with > > something on the lines of Solution-3. > > > > Thoughts? > > > > [1] - > > https://www.postgresql.org/message-id/CAA4eK1%2BQC74wRQmbYT%2BMmOs%3DYbdUjuq0_A9CBbVoQMB1Ryi-OA%40mail.gmail.com > > [2] - > > https://www.postgresql.org/message-id/cahut+puemk4so8ogzxc_ftzpkga8uc-y5qi-krqhsy_p0i3...@mail.gmail.com > > [3] - > > https://www.postgresql.org/message-id/CAA4eK1KFsjf6x-S7b0dJLvEL3tcn9x-voBJiFoGsccyH5xgDzQ%40mail.gmail.com > > [4] - > > https://www.postgresql.org/message-id/CAA4eK1Ld9XaLoTZCoKF_gET7kc1fDf8CPR3CM48MQb1N1jDLYg%40mail.gmail.com > > > > -- > > Hi Amit, > > - Solution-3 has become too complicated to be attempted by me. Anyway, > we may be better to just focus on eliminating the new problems exposed > by the 2PC work [1], rather than burning too much effort to fix some > other quirk which apparently has existed for years. > [1] > https://www.postgresql.org/message-id/CAHut%2BPtm7E5Jj92tJWPtnnjbNjJN60_%3DaGGKYW3h23b7J%3DqeDg%40mail.gmail.com > > - Solution-2 has some potential lag problems, and maybe file resource > problems as well. This idea did not get a very favourable response > when I first proposed it. > > - This leaves Solution-1 as the best viable option to fix the current > known 2PC trouble. > > ~~ > > So I will try to write a patch for the proposed Solution-1.
Yeah, even I think that the Solution-1 is best for solving the problem for 2PC. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com