Re: GNU/Hurd portability patches

Michael Banck Thu, 05 Feb 2026 00:28:10 -0800

Hi,

(adding Samuel to CC)


On Sun, Oct 12, 2025 at 01:42:30PM +1300, Thomas Munro wrote:
> On Sun, Oct 12, 2025 at 1:00 AM Alexander Lakhin <[email protected]> wrote:
> > !!!wrapper_handler[1988]| postgres_signal_arg: 30, PG_NSIG: 33
> > !!!wrapper_handler[1989]| postgres_signal_arg: 30, PG_NSIG: 33
> > !!!wrapper_handler[3284]| postgres_signal_arg: 14, PG_NSIG: 33
> > !!!wrapper_handler[3284]| postgres_signal_arg: 28476608, PG_NSIG: 33
> > TRAP: failed Assert("postgres_signal_arg < PG_NSIG"), File: "pqsignal.c", 
> > Line: 94, PID: 3284

Those issues are really quite rare at least on fruitcrow (but Alexander
has better introspection into this I think), but the test_shm_mq hangs
are frequent enough (most/all of the timeouts on
https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=fruitcrow&br=master
and other branches) that I was trying to get a handle on them.

What was really irritating was that gdb didn't show anything much for
backends (contrary to the postmaster process), like:

|#0  0x00000001023cdaec in ?? () from /lib/ld-x86-64.so.1
|#1  0x00000001023ce1ae in ?? () from /lib/ld-x86-64.so.1
|#2  0x00000001023f6db8 in ?? ()
|#3  0x0000000000000000 in ?? ()

so debugging this was difficult. I managed to get Samuel Thibault to
look at it at FOSDEM last weekend and he immediately figured out that
not enough information was copied on fork() and committed a fix to
glibc[1].

With this, the backtrace for the hanging test_shm_mq background worker
looks like this:

(from the Mach debugger:)
|switch_context(...)+0xbb
|thread_invoke(...)+0xe3
|thread_block(...)+0x4b
|gsync_wait(...)+0x1d8
|_Xgsync_wait(...)+0x73
|ipc_kobject_server(...)+0xa5
|mach_msg_trap(...)+0x8b0
|syscall64(...)+0xdd
|>>>>> user space <<<<<

(from gdb:)
|(gdb) bt
|#0  0x00000001023cdc6c in __GI___mach_msg_trap ()
|    at ./build-tree/hurd-amd64-libc/mach/mach_msg_trap.S:2
|#1  0x00000001023ce32e in __GI___mach_msg (msg=msg@entry=0x101cce1c0, 
option=option@entry=3,
|    send_size=send_size@entry=112, rcv_size=rcv_size@entry=48, rcv_name=18, 
timeout=timeout@entry=0,
|    notify=0) at ./mach/msg.c:111
|#2  0x0000000102643579 in __gsync_wait (task=<optimized out>, 
addr=addr@entry=35184372089520,
|    val1=val1@entry=2, val2=val2@entry=0, msec=msec@entry=0, 
flags=flags@entry=0)
|    at ./build-tree/hurd-amd64-libc/mach/RPC_gsync_wait.c:187
|#3  0x0000000101d04bee in __GI___pthread_mutex_lock 
(mtxp=mtxp@entry=0x2000000002b0)
|    at ../sysdeps/mach/hurd/htl/pt-mutex-lock.c:37
|#4  0x0000000101d08944 in __pthread_enable_asynccancel () at 
./htl/cancellation.c:28
|#5  0x000000010250b065 in __GI___libc_write (fd=4, buf=0x101cce2fb, nbytes=1)
|    at ../sysdeps/mach/hurd/write.c:25
|#6  0x000000010063b6c4 in sendSelfPipeByte () at waiteventset.c:1925
|#7  0x000000010063b69c in latch_sigurg_handler (postgres_signal_arg=16) at 
waiteventset.c:1914
|#8  0x00000001008b9c6c in wrapper_handler (postgres_signal_arg=16) at 
pqsignal.c:110
|#9  <signal handler called>
|#10 0x0000000101d04795 in __pthread_testcancel () at ./htl/pt-testcancel.c:31
|#11 0x0000000101d0895c in __pthread_enable_asynccancel () at 
./htl/cancellation.c:33
|#12 0x0000000102505a65 in __GI___libc_read (fd=3, buf=0x101ccec00, nbytes=1)
|#13 0x000000010063b729 in drain () at waiteventset.c:1974
|#14 0x000000010063b4b4 in WaitEventSetWaitBlock (set=0x200000015418, 
cur_timeout=-1,
|    occurred_events=0x101ccf0f0, nevents=1) at waiteventset.c:1534
|#15 0x000000010063b268 in WaitEventSetWait (set=0x200000015418, timeout=-1,
|    occurred_events=0x101ccf0f0, nevents=1, wait_event_info=134217764) at 
waiteventset.c:1140
|#16 0x000000010062c9e6 in WaitLatch (latch=0x10df221a4, wakeEvents=33, 
timeout=0,
|#17 0x0000000100635b32 in shm_mq_send_bytes (mqh=0x200000051ab8, nbytes=130, 
data=0x200000049fd8,
|#18 0x0000000100635021 in shm_mq_sendv (mqh=0x200000051ab8, iov=0x101ccf250, 
iovcnt=1, nowait=false,
|#19 0x0000000100634c95 in shm_mq_send (mqh=0x200000051ab8, nbytes=130, 
data=0x200000049fd8,
|#20 0x000000010e8aa72d in copy_messages (inqh=0x200000051b40, 
outqh=0x200000051ab8) at worker.c:200
|#21 test_shm_mq_main (main_arg=<optimized out>) at worker.c:141
[...]

So there's a signal handler and pthread cancellation involved. Having a
backtrace enabled Samuel to push a fix to glibc[2] that adds a critical
section to the cancel lock.

Before, Postgres would hang in a few ten thousand (or at most a few
hundred thousand) iterations when running something like this (50000
iterations overall in this case), adopted from
src/test/modules/test_shm_mq/sql/test_shm_mq.sql:

|postgres=# CREATE EXTENSION test_shm_mq; -- once
|postgres=# SELECT test_shm_mq(100, (select 
string_agg(chr(32+(random()*95)::int), '') from 
generate_series(1,(100+200*random())::int)), 50000, 1);

With the glibc patch from [2] applied, I have been running the above SQL
for 5 million iterations without a hang just now. So those semi-frequent
hangs on fruitcrow should be fixed now, yay!

> Hmm.  We only install the handler for real signal numbers, and it
> clearly managed to find the handler, so then how did it corrupt signo
> before calling the function?  I wonder if there could concurrency bugs
> reached by our perhaps unusually large amount of signaling (we have
> found bugs in the signal implementations of several other OSes...).
> This might be the code:
> 
> https://github.com/bminor/glibc/blob/master/hurd/hurdsig.c#L639
> 
> It appears to suspend the thread selected to handle the signal, mess
> with its stack/context and then resume it, just like traditional
> monokernels, it's just done in user space by code running in a helper
> thread that communicates over Mach ports.  So it looks like I
> misunderstood that comment in the docs, it's not the handler itself
> that runs in a different thread, unless I'm looking at the wrong code
> (?).
> 
> Some random thoughts after skim-reading that and
> glibc/sysdeps/mach/hurd/x86/trampoline.c:
> * I wonder if setting up sigaltstack() and then using SA_ONSTACK in
> pqsignal() would behave differently, though SysV AMD64 calling
> conventions (used by Hurd IIGC) have the first argument in %rdi, not
> the stack, so I don't really expect that to be relevant...

Haven't tried this so far.

> * I wonder about the special code paths for handlers that were already
> running and happened to be in sigreturn(), or something like that,
> which I didn't study at all, but it occurred to me that our pqsignal
> will only block the signal itself while running a handler (since it
> doesn't specify SA_NODEFER)... so what happens if you block all
> signals while running each handler by changing
> sigemptyset(&act.sa_mask) to sigfillset(&act.sa_mask)?

I tried that before and it seemed to help a bit, but the hangs are
pretty random and certainly still happen from time to time with just
this change.

> * I see special code paths for threads that were in (its notion of)
> critical sections, which must be rare, but it looks like that just
> leave it pending which seems reasonable
> * I see special code paths for SIGIO and SIGURG that I didn't try to
> understand, but I wonder what would happen if we s/SIGURG/SIGXCPU/

I did that as well, does not seem to change much, either.
 
[...]

> Thinking of other maybe-slightly-unusual things in the signal
> processing area that have been problematic in a couple of other OSes
> (ie systems that added emulations of Linux system calls), I wondered
> about epoll and signalfd, but it doesn't have those either, so it must
> be using plain old poll() with the widely used self-pipe trick for
> latches, and that doesn't seem likely to be new or buggy code.

Yeah, it uses that code path.
 

Michael

[1] 
https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=9513e9e5a45fd1c6165c115f43f103f93e7a7faa
[2] 
https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=c4a81e882e607a34d0c26caf279c98398e9c1e4d

Re: GNU/Hurd portability patches

Reply via email to