On Sun, Mar 20, 2022 at 3:23 PM Amit Kapila <amit.kapil...@gmail.com> wrote: > > On Sun, Mar 20, 2022 at 8:41 AM Amit Kapila <amit.kapil...@gmail.com> wrote: > > > > On Fri, Mar 18, 2022 at 10:42 PM Tomas Vondra > > <tomas.von...@enterprisedb.com> wrote: > > > > > So the question is why those two sync workers never complete - I guess > > > there's some sort of lock wait (deadlock?) or infinite loop. > > > > > > > It would be a bit tricky to reproduce this even if the above theory is > > correct but I'll try it today or tomorrow. > > > > I am able to reproduce it with the help of a debugger. Firstly, I have > added the LOG message and some While (true) loops to debug sync and > apply workers. Test setup > > Node-1: > create table t1(c1); > create table t2(c1); > insert into t1 values(1); > create publication pub1 for table t1; > create publication pu2; > > Node-2: > change max_sync_workers_per_subscription to 1 in potgresql.conf > create table t1(c1); > create table t2(c1); > create subscription sub1 connection 'dbname = postgres' publication pub1; > > Till this point, just allow debuggers in both workers just continue. > > Node-1: > alter publication pub1 add table t2; > insert into t1 values(2); > > Here, we have to debug the apply worker such that when it tries to > apply the insert, stop the debugger in function apply_handle_insert() > after doing begin_replication_step(). > > Node-2: > alter subscription sub1 set pub1, pub2; > > Now, continue the debugger of apply worker, it should first start the > sync worker and then exit because of parameter change. All of these > debugging steps are to just ensure the point that it should first > start the sync worker and then exit. After this point, table sync > worker never finishes and log is filled with messages: "reached > max_sync_workers_per_subscription limit" (a newly added message by me > in the attached debug patch). > > Now, it is not completely clear to me how exactly '013_partition.pl' > leads to this situation but there is a possibility based on the LOGs
I've looked at this issue and had the same analysis. Also, I could reproduce this issue with the steps shared by Amit. As I mentioned in another thread[1], the fact that the tablesync worker doesn't check the return value from wait_for_worker_state_change() seems a bug to me. So my initial thought of the solution is that we can have the tablesync worker check the return value and exit if it's false. That way, the apply worker can restart and request to launch the tablesync worker again. What do you think? Regards, -- Masahiko Sawada EDB: https://www.enterprisedb.com/