On Thu, Jul 7, 2022 at 8:39 AM Andres Freund <and...@anarazel.de> wrote: > On 2022-07-06 21:29:41 +0200, Alvaro Herrera wrote: > > On 2022-Jul-05, Andres Freund wrote: > > > > > I think we'd be better off disabling at least some signals during > > > dsm_impl_posix_resize(). I'm afraid we'll otherwise just find another > > > variation of these problems. I haven't checked the source of ftruncate, > > > but > > > what Thomas dug up for fallocate makes it pretty clear that our current > > > approach of just retrying again and again isn't good enough. It's a bit > > > more > > > obvious that it's a problem for fallocate, but I don't think it's worth > > > having > > > different solutions for the two. > > > > So what if we move the retry loop one level up? As in the attached. > > Here, if we get EINTR then we retry both syscalls. > > Doesn't really seem to address the problem to me. posix_fallocate() > takes some time (~1s for 3GB roughly), so if we signal at a higher rate, > we'll just get stuck. > > I hacked a bit on a test program from Thomas, and it's pretty clearly > that with a 5ms timer interval you'll pretty much not make > progress. It's much easier to get fallocate() to get interrupted than > ftruncate(), but the latter gets interrupted e.g. when you do a strace > in the "wrong" moment (afaics SIGSTOP/SIGCONT trigger EINTR in > situations that are retried otherwise). > > So I think we need: 1) block most signals, 2) a retry loop *without* > interrupt checks.
Yeah. I was also wondering about wrapping the whole function in PG_SETMASK(&BlockSig), PG_SETMASK(&UnBlockSig), but also leaving the while (rc == EINTR) loop there (without the check for *Pending variables), only because otherwise when you attach a debugger and continue you'll get a spurious EINTR and it'll interfere with program execution. All blockable signals would be blocked *except* SIGQUIT, which means that fast shutdown/crash will still work. It seems nice to leave that way to interrupt it without resorting to SIGKILL.