Hi,

On 2025-03-25 13:18:50 -0700, Noah Misch wrote:
> On Tue, Mar 25, 2025 at 04:07:35PM -0400, Andres Freund wrote:
> > On 2025-03-25 12:39:56 -0700, Noah Misch wrote:
> > > On Tue, Mar 25, 2025 at 02:58:37PM -0400, Andres Freund wrote:
> > > > There are 2 1/2 ways around this:
> > > > 
> > > > 1) Stop using IOSQE_ASYNC heuristic
> > > > 2a) Wait for all in-flight IOs when any FD gets closed
> > > > 2b) Wait for all in-flight IOs using FD when it gets closed
> > > > 
> > > > Given that we have clear evidence that io_uring doesn't completely 
> > > > support
> > > > closing FDs while IOs are in flight, be it a bug or intentional, it 
> > > > seems
> > > > clearly better to go for 2a or 2b.
> > > 
> > > Agreed.  If a workload spends significant time on fd.c closing files, I
> > > suspect that workload already won't have impressive benchmark numbers.
> > > Performance-seeking workloads will already want to tune FD usage high 
> > > enough
> > > to keep FDs long-lived.  So (1) clearly loses, and neither (2a) nor (2b)
> > > clearly beats the other.  I'd try (2b) first but, if complicated, quickly
> > > abandon it in favor of (2a).  What other considerations could be 
> > > important?
> > 
> > The only other consideration I can think of is whether this should happen 
> > for
> > all io_methods or not.
> 
> Either way is fine, I think.

Here's a draft incremental patch (attached as a .fixup to avoid triggering
cfbot) implementing 2b).


> > I'm inclined to do it via a bool in IoMethodOps, but I guess one could argue
> > it's a bit weird to have a bool in a struct called *Ops.
> 
> That wouldn't bother me.  IndexAmRoutine has many bools, and "Ops" is
> basically a synonym of "Routine".

Cool. Done that way.

The repeated-iteration approach taken in pgaio_closing_fd() isn't the
prettiest, but it's hard to to imagine that ever being a noticeable.


This survives a testrun where I use your torture patch and where I force all
IOs to use ASYNC. Previously that did not get very far.  I also did verify
that, if I allow a small number of FDs, we do not wrongly wait for all IOs.

Greetings,

Andres Freund
commit 290de30c0e4a87ed3bd6cdef478376d39a922912
Author: Andres Freund <and...@anarazel.de>
Date:   2025-03-25 16:50:27 -0400

    wip: aio: io_uring wait for in-flight IOs
    
    Author:
    Reviewed-by:
    Discussion: https://postgr.es/m/
    Backpatch:

diff --git a/src/include/storage/aio_internal.h 
b/src/include/storage/aio_internal.h
index a46d4393ebc..352610f60a8 100644
--- a/src/include/storage/aio_internal.h
+++ b/src/include/storage/aio_internal.h
@@ -287,6 +287,15 @@ typedef struct PgAioCtl
  */
 typedef struct IoMethodOps
 {
+       /* properties */
+
+       /*
+        * If an FD is about to be closed, do we need to wait for all in-flight
+        * IOs referencing that FD?
+        */
+       bool            wait_on_fd_before_close;
+
+
        /* global initialization */
 
        /*
@@ -368,6 +377,7 @@ extern PgAioResult pgaio_io_call_complete_local(PgAioHandle 
*ioh);
 /* aio_io.c */
 extern void pgaio_io_perform_synchronously(PgAioHandle *ioh);
 extern const char *pgaio_io_get_op_name(PgAioHandle *ioh);
+extern bool pgaio_io_uses_fd(PgAioHandle *ioh, int fd);
 
 /* aio_target.c */
 extern bool pgaio_io_can_reopen(PgAioHandle *ioh);
diff --git a/src/backend/storage/aio/aio.c b/src/backend/storage/aio/aio.c
index c512495dce3..8816ccfa0d7 100644
--- a/src/backend/storage/aio/aio.c
+++ b/src/backend/storage/aio/aio.c
@@ -1303,6 +1303,41 @@ pgaio_closing_fd(int fd)
         * it's probably not worth it.
         */
        pgaio_submit_staged();
+
+       /*
+        * If requested by the IO method, wait for all IOs that use the
+        * to-be-closed FD.
+        */
+       if (pgaio_method_ops->wait_on_fd_before_close)
+       {
+               /*
+                * As waiting for one IO to complete may complete multiple IOs, 
we
+                * can't just use a mutable list iterator. The maximum number of
+                * in-flight IOs is fairly small, so just restart the loop after
+                * waiting for an IO.
+                */
+               while (!dclist_is_empty(&pgaio_my_backend->in_flight_ios))
+               {
+                       dlist_iter      iter;
+                       PgAioHandle *ioh = NULL;
+
+                       dclist_foreach(iter, &pgaio_my_backend->in_flight_ios)
+                       {
+                               ioh = dclist_container(PgAioHandle, node, 
iter.cur);
+
+                               if (pgaio_io_uses_fd(ioh, fd))
+                                       break;
+                               else
+                                       ioh = NULL;
+                       }
+
+                       if (!ioh)
+                               break;
+
+                       /* see comment in pgaio_io_wait_for_free() about 
raciness */
+                       pgaio_io_wait(ioh, ioh->generation);
+               }
+       }
 }
 
 /*
diff --git a/src/backend/storage/aio/aio_io.c b/src/backend/storage/aio/aio_io.c
index 97930d687c2..108021fa377 100644
--- a/src/backend/storage/aio/aio_io.c
+++ b/src/backend/storage/aio/aio_io.c
@@ -186,3 +186,25 @@ pgaio_io_get_op_name(PgAioHandle *ioh)
 
        return NULL;                            /* silence compiler */
 }
+
+/*
+ * Used to determine if an IO needs to be waited upon before the file
+ * descriptor can be closed.
+ */
+bool
+pgaio_io_uses_fd(PgAioHandle *ioh, int fd)
+{
+       Assert(ioh->state >= PGAIO_HS_DEFINED);
+
+       switch (ioh->op)
+       {
+               case PGAIO_OP_READV:
+                       return ioh->op_data.read.fd == fd;
+               case PGAIO_OP_WRITEV:
+                       return ioh->op_data.write.fd == fd;
+               case PGAIO_OP_INVALID:
+                       return false;
+       }
+
+       return false;                           /* silence compiler */
+}
diff --git a/src/backend/storage/aio/method_io_uring.c 
b/src/backend/storage/aio/method_io_uring.c
index aab09793684..3b299dcf388 100644
--- a/src/backend/storage/aio/method_io_uring.c
+++ b/src/backend/storage/aio/method_io_uring.c
@@ -58,6 +58,15 @@ static void pgaio_uring_sq_from_io(PgAioHandle *ioh, struct 
io_uring_sqe *sqe);
 
 
 const IoMethodOps pgaio_uring_ops = {
+       /*
+        * While io_uring mostly is OK with FDs getting closed while the IO is 
in
+        * flight, that is not true for IOs submitted with IOSQE_ASYNC.
+        *
+        * See
+        * 
https://postgr.es/m/5ons2rtmwarqqhhexb3dnqulw5rjgwgoct57vpdau4rujlrffj%403fls6d2mkiwc
+        */
+       .wait_on_fd_before_close = true,
+
        .shmem_size = pgaio_uring_shmem_size,
        .shmem_init = pgaio_uring_shmem_init,
        .init_backend = pgaio_uring_init_backend,

Reply via email to