10.11.2025 22:03, Thomas Munro wrote:
On Tue, Nov 11, 2025 at 8:00 AM Alexander Lakhin <[email protected]> wrote:
With this modification:
@@ -137,7 +140,7 @@ pqsignal(int signo, pqsigfunc func)
#if !(defined(WIN32) && defined(FRONTEND))
act.sa_handler = func;
- sigemptyset(&act.sa_mask);
+ sigfillset(&act.sa_mask);
act.sa_flags = SA_RESTART;
I got 100 iterations passed (12 of them hanged) without that Assert
triggered.
Interesting. Perhaps a minimal program that installs a handler
assert(signo < 32) for both SIGUSR1 and SIGUSR2 might fail too, if
another program loops calling kill(the_other_one, rand() % 2 == 0 ?
SIGUSR1 : SIGUSR2), to support a bug report?
Yeah, thank you for the idea! I will try it in the coming days.
[lots of weird errors in a wide range of code]
I can't make much sense of these failures, but are you saying that
these only happen without that sigfillset(&act.sa_mask) change, that
is, when the signal implementation is misbehaving? If so, I wonder if
the same bug in their signal handling might just be corrupting the
user stack sometimes even when the signal number assertion doesn't
trip.
No, I think those failures are unrelated, I hit them just because I
executed `make check` many times and some of them definitely occurred
with the unmodified code. Now that I have a script that handles OS hangs
and restores VM's disk automatically, I can run tests for hours and look
for one failure or another if it can be helpful.
On the assumption that this isn't a general bug, but just a timing issue
(planning 'SELECT 1' isn't complicated), I see two possibilities:
1. Ignore the plan times, and replace SELECT 1 with SELECT
pg_sleep(1e-6), similar to e849bd551. I guess this would reduce test
coverage so likely not be great?
2. Make the query a bit more complicated so that the plan time is likely
to be non-negligable. I actually had to go quite a way to make it pretty
failsafe, the attached made it fail less than 5 times out of 50000
iterations, not sure whether that is acceptable or still considered
flaky?
Wait, we have tests that fail if the clock doesn't advance? Isn't
that just bogus?
Yeah, we have, this was discussed (and one test was hardened) upthread.
What concerns me is that there is also subscription.sql and maybe could
be other test(s) that expect at least 1000ns (far from infinite) timer
resolution. Probably it would make sense to define which timer resolution
we consider acceptable for tests and then to check if Hurd can provide it.
Ah, I see, so that one is checking if the last reset time advanced to
check that something happened. That also has the theoretical problem
that CLOCK_REALTIME can go backwards sometimes, due to ntpd
adjustments or whatever. In the absence of a "reset_counter" column,
perhaps we could consider a kludge like x->reset_time =
Max(x->reset_time + 1ns, now), just to make sure the value always goes
up on reset, without having any noticeable effect on normal systems...
AFAICS, those test cases use pg_clock_gettime_ns() with CLOCK_MONOTONIC
(if defined, and it's really defined on Hurd), so it should not matter in
this concrete case.
Best regards,
Alexander