I wrote:
> Thomas Munro <[email protected]> writes:
>> On Fri, Nov 18, 2022 at 11:08 AM Tom Lane <[email protected]> wrote:
>>> mamba has been showing intermittent failures in various replication
>>> tests since day one.
>> I wonder if it's a runtime variant of the other problem. We do
>> load_file("libpqwalreceiver", false) before unblocking signals but
>> maybe don't resolve the symbols until calling them, or something like
>> that...
> Yeah, that or some other NetBSD bug could be the explanation, too.
> Without a stack trace it's hard to have any confidence about it,
> but I've been unable to reproduce the problem outside the buildfarm.
Thanks to commit 51b5834cd I've now been able to capture some info
from mamba's last couple of failures [1][2]. Sure enough, what is
happening is that postmaster children are getting stuck in recursive
rtld symbol resolution. A couple of the stack traces I collected are
#0 0xfdeede4c in ___lwp_park60 () from /usr/libexec/ld.elf_so
#1 0xfdee3e08 in _rtld_exclusive_enter () from /usr/libexec/ld.elf_so
#2 0xfdee59e4 in dlopen () from /usr/libexec/ld.elf_so
#3 0x01e54ed0 in internal_load_library (
libname=libname@entry=0xfd74cc88
"/home/buildfarm/bf-data/HEAD/pgsql.build/tmp_install/home/buildfarm/bf-data/HEAD/inst/lib/postgresql/libpqwalreceiver.so")
at dfmgr.c:239
#4 0x01e55c78 in load_file (filename=<optimized out>, restricted=<optimized
out>) at dfmgr.c:156
#5 0x01c5ba24 in WalReceiverMain () at walreceiver.c:292
#6 0x01c090f8 in AuxiliaryProcessMain
(auxtype=auxtype@entry=WalReceiverProcess) at auxprocess.c:161
#7 0x01c10970 in StartChildProcess (type=WalReceiverProcess) at
postmaster.c:5310
#8 0x01c123ac in MaybeStartWalReceiver () at postmaster.c:5475
#9 MaybeStartWalReceiver () at postmaster.c:5468
#10 sigusr1_handler (postgres_signal_arg=<optimized out>) at postmaster.c:5131
#11 <signal handler called>
#12 0xfdee6b44 in _rtld_symlook_obj () from /usr/libexec/ld.elf_so
#13 0xfdee6fc0 in _rtld_symlook_list () from /usr/libexec/ld.elf_so
#14 0xfdee7644 in _rtld_symlook_default () from /usr/libexec/ld.elf_so
#15 0xfdee795c in _rtld_find_symdef () from /usr/libexec/ld.elf_so
#16 0xfdee7ad0 in _rtld_find_plt_symdef () from /usr/libexec/ld.elf_so
#17 0xfdee1918 in _rtld_bind () from /usr/libexec/ld.elf_so
#18 0xfdee1dc0 in _rtld_bind_secureplt_start () from /usr/libexec/ld.elf_so
Backtrace stopped: frame did not save the PC
#0 0xfdeede4c in ___lwp_park60 () from /usr/libexec/ld.elf_so
#1 0xfdee3e08 in _rtld_exclusive_enter () from /usr/libexec/ld.elf_so
#2 0xfdee4ba4 in _rtld_exit () from /usr/libexec/ld.elf_so
#3 0xfd54ea74 in __cxa_finalize () from /usr/lib/libc.so.12
#4 0xfd54e354 in exit () from /usr/lib/libc.so.12
#5 0x01c963c0 in proc_exit (code=code@entry=0) at ipc.c:152
#6 0x01c056e4 in AutoVacLauncherShutdown () at autovacuum.c:853
#7 0x01c071dc in AutoVacLauncherMain (argv=0x0, argc=0) at autovacuum.c:800
#8 0x01c07694 in StartAutoVacLauncher () at autovacuum.c:416
#9 0x01c11d3c in reaper (postgres_signal_arg=<optimized out>) at
postmaster.c:3038
#10 <signal handler called>
#11 0xfdee6f64 in _rtld_symlook_list () from /usr/libexec/ld.elf_so
#12 0xfdee7644 in _rtld_symlook_default () from /usr/libexec/ld.elf_so
#13 0xfdee795c in _rtld_find_symdef () from /usr/libexec/ld.elf_so
#14 0xfdee7ad0 in _rtld_find_plt_symdef () from /usr/libexec/ld.elf_so
#15 0xfdee1918 in _rtld_bind () from /usr/libexec/ld.elf_so
#16 0xfdee1dc0 in _rtld_bind_secureplt_start () from /usr/libexec/ld.elf_so
Backtrace stopped: frame did not save the PC
which is pretty much just the same thing we were seeing before
commit 8acd8f869 :-(
Now, we certainly cannot think that these are occurring early in
postmaster startup. In the wake of 8acd8f869, we should expect
that there's no further need to call rtld_bind at all in the
postmaster, but seemingly that's not so. It's very frustrating
that the backtrace stops where it does :-(. It's also strange
that we're apparently running with signals enabled whereever
it is that rtld_bind is getting called from. Could it be that
sigaction is failing to install the requested signal mask, so
that one postmaster signal handler is interrupting another?
regards, tom lane
[1]
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2022-11-24%2021%3A45%3A29
[2]
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2022-11-29%2020%3A50%3A36