Hi, On 2025-05-08 19:22:27 -0700, Noah Misch wrote: > On Thu, May 08, 2025 at 09:06:18PM -0400, Andres Freund wrote: > > On 2025-05-02 20:05:11 -0700, Noah Misch wrote: > > > On Wed, Apr 30, 2025 at 04:00:35PM -0400, Andres Freund wrote: > > > We do need to hold interrupts in a few other places, I think - with some > > debug > > infrastructure (things like calling ProcessBarrierSmgrRelease() whenever > > interrupts could be processed and calling CFI() in errstart() in its return > > false case) it's possible to find state confusions which trigger > > assertions. The issue is that pgaio_io_update_state() contains a > > pgaio_debug_io() and executing pgaio_closing_fd() in places that call > > pgaio_io_update_state() doesn't end well. There's a similar danger with the > > debug message in pgaio_io_reclaim(). > > > > In the attached patch I added an assertion to pgaio_io_update_state() > > verifying that interrupts are held and added code to hold interrupts in the > > relevant places. > > Works for me. > > > > For the "no free IOs despite no in-flight IOs" case, I'd replace the > > > ereport(ERROR) with "return;" since we now know interrupt processing > > > reclaimed > > > an IO. > > > > Hm - it seems better to me to check if there are now free handles and return > > if that's the case, but to keep the error check in case there actually is no > > free IO? That seems like a not implausible bug... > > Works for me. > > > > Then decide what protection if any, we need against bugs causing an > > > infinite loop in caller pgaio_io_acquire(). What's the case motivating > > > the > > > unbounded loop in pgaio_io_acquire(), as opposed to capping at two > > > pgaio_io_acquire_nb() calls? If the theory is that pgaio_io_acquire() > > > could > > > be reentrant, what scenario would reach that reentrancy? > > > > I do not remember why I wrote this as an endless loop. If you prefer I > > could > > change that as part of this patch. > > I asked because it would be sad to remove the ereport(ERROR) like I proposed > and then have a bug cause a real infinite loop. Removing the loop was one way > to prove that can't happen. As you say, another way would be keeping the > ereport(ERROR) and guarding it with a free-handles check, like in your patch > today. I don't have a strong preference between those.
I pushed it this way just now. Thanks again for Alexander for reporting & investigating the issues and for Noah's review. Greetings, Andres Freund