Hi,

On 2025-05-08 19:22:27 -0700, Noah Misch wrote:
> On Thu, May 08, 2025 at 09:06:18PM -0400, Andres Freund wrote:
> > On 2025-05-02 20:05:11 -0700, Noah Misch wrote:
> > > On Wed, Apr 30, 2025 at 04:00:35PM -0400, Andres Freund wrote:
> 
> > We do need to hold interrupts in a few other places, I think - with some 
> > debug
> > infrastructure (things like calling ProcessBarrierSmgrRelease() whenever
> > interrupts could be processed and calling CFI() in errstart() in its return
> > false case) it's possible to find state confusions which trigger
> > assertions. The issue is that pgaio_io_update_state() contains a
> > pgaio_debug_io() and executing pgaio_closing_fd() in places that call
> > pgaio_io_update_state() doesn't end well.  There's a similar danger with the
> > debug message in pgaio_io_reclaim().
> > 
> > In the attached patch I added an assertion to pgaio_io_update_state()
> > verifying that interrupts are held and added code to hold interrupts in the
> > relevant places.
> 
> Works for me.
> 
> > > For the "no free IOs despite no in-flight IOs" case, I'd replace the
> > > ereport(ERROR) with "return;" since we now know interrupt processing 
> > > reclaimed
> > > an IO.
> > 
> > Hm - it seems better to me to check if there are now free handles and return
> > if that's the case, but to keep the error check in case there actually is no
> > free IO? That seems like a not implausible bug...
> 
> Works for me.
> 
> > > Then decide what protection if any, we need against bugs causing an
> > > infinite loop in caller pgaio_io_acquire().  What's the case motivating 
> > > the
> > > unbounded loop in pgaio_io_acquire(), as opposed to capping at two
> > > pgaio_io_acquire_nb() calls?  If the theory is that pgaio_io_acquire() 
> > > could
> > > be reentrant, what scenario would reach that reentrancy?
> > 
> > I do not remember why I wrote this as an endless loop.  If you prefer I 
> > could
> > change that as part of this patch.
> 
> I asked because it would be sad to remove the ereport(ERROR) like I proposed
> and then have a bug cause a real infinite loop.  Removing the loop was one way
> to prove that can't happen.  As you say, another way would be keeping the
> ereport(ERROR) and guarding it with a free-handles check, like in your patch
> today.  I don't have a strong preference between those.

I pushed it this way just now.

Thanks again for Alexander for reporting & investigating the issues and for
Noah's review.

Greetings,

Andres Freund


Reply via email to