> -----Original Message----- > From: Zhijie Hou (Fujitsu) <houzj.f...@fujitsu.com> > Sent: Wednesday, August 23, 2023 10:27 AM > To: Thomas Munro <thomas.mu...@gmail.com> > Cc: Amit Kapila <amit.kapil...@gmail.com>; pgsql-hackers > <pgsql-hack...@postgresql.org> > Subject: RE: subscription/015_stream sometimes breaks > > On Wednesday, August 23, 2023 4:55 AM Thomas Munro > <thomas.mu...@gmail.com> wrote: > > > > On Wed, Aug 23, 2023 at 8:21 AM Thomas Munro > <thomas.mu...@gmail.com> > > wrote: > > > I didn't study it closely but it looks like there might be a second > > > deadlock, after the one that is expected by the test? Examples from > > > the past couple of weeks: > > > > I should add, it's not correlated with the patches that cfbot is > > testing, and it's the most frequent failure for which that is the case. > > > > suite | name | distinct_patches | errors > > --------------+------------+------------------+-------- > > subscription | 015_stream | 47 | 61 > > > > Thanks for reporting ! > I am researching the failure and will share my analysis.
Hi, After an off-list discussion with Amit, we figured out the reason. From the crash log, I can see the apply worker crashed when accessing the worker->proc, so I think it's because the work->proc has been released. 577: /* Now terminate the worker ... */ > 578: kill(worker->proc->pid, signo); 579: 580: /* ... and wait for it to die. */ Normally, this should not happen because we take a lock on LogicalRepWorkerLock when shutting all the parallel workers[1] which can prevent concurrent worker to free the worker info. But in logicalrep_worker_stop_internal(), when stopping parallel worker #1, we will release the lock shortly. and at this timing it's possible that another parallel worker #2 which reported an ERROR will shutdown by itself and free the worker->proc. So when we try to stop that parallel worker #2 in next round, we didn't realize it has been closed, thus accessing invalid memory(worker->proc). [1]-- LWLockAcquire(LogicalRepWorkerLock, LW_SHARED); workers = logicalrep_workers_find(MyLogicalRepWorker->subid, true); foreach(lc, workers) { LogicalRepWorker *w = (LogicalRepWorker *) lfirst(lc); ** if (isParallelApplyWorker(w)) logicalrep_worker_stop_internal(w, SIGTERM); } -- The bug happens after commit 2a8b40e where isParallelApplyWorker() start to use the worker->type to check but we forgot to reset the worker type at worker exit time. So, even if the worker #2 has shutdown, the worker_type is still valid and we try to stop it again. Previously, the isParallelApplyWorker() used the worker->leader_pid which will be reset when the worker exits, so the "if (isParallelApplyWorker(w))" won't pass in this case and we don't try to stop the worker #2. To fix it I think we need to reset the worker type at exit as well. Attach the patch which does the same. I am also testing it locally to see if there are other issues here. Best Regards, Hou Zhijie
0001-Reset-worker-type-at-exit-time.patch
Description: 0001-Reset-worker-type-at-exit-time.patch