Re: Column Filtering in Logical Replication

Masahiko Sawada Wed, 13 Apr 2022 01:11:25 -0700

On Sun, Mar 20, 2022 at 3:23 PM Amit Kapila <[email protected]> wrote:
>
> On Sun, Mar 20, 2022 at 8:41 AM Amit Kapila <[email protected]> wrote:
> >
> > On Fri, Mar 18, 2022 at 10:42 PM Tomas Vondra
> > <[email protected]> wrote:
> >
> > > So the question is why those two sync workers never complete - I guess
> > > there's some sort of lock wait (deadlock?) or infinite loop.
> > >
> >
> > It would be a bit tricky to reproduce this even if the above theory is
> > correct but I'll try it today or tomorrow.
> >
>
> I am able to reproduce it with the help of a debugger. Firstly, I have
> added the LOG message and some While (true) loops to debug sync and
> apply workers. Test setup
>
> Node-1:
> create table t1(c1);
> create table t2(c1);
> insert into t1 values(1);
> create publication pub1 for table t1;
> create publication pu2;
>
> Node-2:
> change max_sync_workers_per_subscription to 1 in potgresql.conf
> create table t1(c1);
> create table t2(c1);
> create subscription sub1 connection 'dbname = postgres' publication pub1;
>
> Till this point, just allow debuggers in both workers just continue.
>
> Node-1:
> alter publication pub1 add table t2;
> insert into t1 values(2);
>
> Here, we have to debug the apply worker such that when it tries to
> apply the insert, stop the debugger in function apply_handle_insert()
> after doing begin_replication_step().
>
> Node-2:
> alter subscription sub1 set pub1, pub2;
>
> Now, continue the debugger of apply worker, it should first start the
> sync worker and then exit because of parameter change. All of these
> debugging steps are to just ensure the point that it should first
> start the sync worker and then exit. After this point, table sync
> worker never finishes and log is filled with messages: "reached
> max_sync_workers_per_subscription limit" (a newly added message by me
> in the attached debug patch).
>
> Now, it is not completely clear to me how exactly '013_partition.pl'
> leads to this situation but there is a possibility based on the LOGs


I've looked at this issue and had the same analysis. Also, I could
reproduce this issue with the steps shared by Amit.

As I mentioned in another thread[1], the fact that the tablesync
worker doesn't check the return value from
wait_for_worker_state_change() seems a bug to me. So my initial
thought of the solution is that we can have the tablesync worker check
the return value and exit if it's false. That way, the apply worker
can restart and request to launch the tablesync worker again. What do
you think?

Regards,

-- 
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: Column Filtering in Logical Replication

Reply via email to