Hello Thomas,

12.10.2025 06:35, Thomas Munro wrote:
On Sun, Oct 12, 2025 at 2:00 AM Alexander Lakhin <[email protected]> wrote:
2025-10-11 11:34:46.793 GMT [1169773:1] PANIC:  !!!pgaio_io_wait| ioh->state 
changed from 0 to 1 at iteration 0
# no other iteration number observed
Can you please disassemble pgaio_io_update_state() and
pgaio_io_was_recycled()?  I wonder if the memory barriers are not
being generated correctly, causing the state and generation to be
loaded out of order, or something like that...

Please find those attached (gdb "disass/m pgaio_io_update_state" misses
the start of the function (but it's still disassembled below), so I
decided to share the whole output).
This is from clean master, without any modifications, but with the issue
confirmed for this build:
2025-10-11 20:29:11.724 UTC [1679534:1] [unknown] LOG:  connection received: 
host=[local]
2025-10-11 20:29:11.725 UTC [1679536:1] [unknown] LOG:  connection received: 
host=[local]
2025-10-11 20:29:11.726 UTC [1679538:1] [unknown] LOG:  connection received: 
host=[local]
2025-10-11 20:29:11.724 UTC [1679533:1] [unknown] LOG:  connection received: 
host=[local]
2025-10-11 20:29:11.724 UTC [1679537:1] [unknown] LOG:  connection received: 
host=[local]
2025-10-11 20:29:11.724 UTC [1679535:1] [unknown] LOG:  connection received: 
host=[local]
2025-10-11 20:29:11.729 UTC [1679539:1] [unknown] LOG:  connection received: 
host=[local
...
2025-10-11 20:29:11.778 UTC [1679537:3] [unknown] LOG:  connection authorized: user=debian database=regression application_name=pg_regress/create_schema 2025-10-11 20:29:11.778 UTC [1679533:3] [unknown] LOG:  connection authorized: user=debian database=regression application_name=pg_regress/create_type 2025-10-11 20:29:11.778 UTC [1679539:3] [unknown] LOG:  connection authorized: user=debian database=regression application_name=pg_regress/create_procedure 2025-10-11 20:29:11.778 UTC [1679536:3] [unknown] LOG:  connection authorized: user=debian database=regression application_name=pg_regress/create_misc 2025-10-11 20:29:11.778 UTC [1679538:3] [unknown] LOG:  connection authorized: user=debian database=regression application_name=pg_regress/create_table 2025-10-11 20:29:11.778 UTC [1679535:3] [unknown] LOG:  connection authorized: user=debian database=regression application_name=pg_regress/create_function_c
2025-10-11 20:29:11.790 UTC [1679534:2] [unknown] FATAL:  IO in wrong state: 0
2025-10-11 20:29:11.790 UTC [1679534:3] [unknown] BACKTRACE:
        postgres: debian regression [local] authentication(+0x4371ec) 
[0x555565cd61ec]
        postgres: debian regression [local] authentication(+0x442bb2) 
[0x555565ce1bb2]
        postgres: debian regression [local] authentication(StartBufferIO+0x66) 
[0x555565ce1a10]
        postgres: debian regression [local] authentication(+0x43df60) 
[0x555565cdcf60]
        postgres: debian regression [local] 
authentication(StartReadBuffer+0x308) [0x555565cdc8a4]
        postgres: debian regression [local] 
authentication(ReadBufferExtended+0x64) [0x555565cdaace]
...
        postgres: debian regression [local] 
authentication(postmaster_child_launch+0x132) [0x555565c7787c]
        postgres: debian regression [local] authentication(+0x3dcd52) 
[0x555565c7bd52]
        postgres: debian regression [local] 
authentication(InitProcessGlobals+0) [0x555565c79c96]
        postgres: debian regression [local] authentication(+0x3194b2) 
[0x555565bb84b2]
        /lib/riscv64-linux-gnu/libc.so.6(+0x2791c) [0x7fffa690891c]
        /lib/riscv64-linux-gnu/libc.so.6(__libc_start_main+0x74) 
[0x7fffa69089c4]
        postgres: debian regression [local] authentication(_start+0x20) 
[0x5555659842c8]

The previous failure on greenfly was a TIMEOUT in the same test, as if
a query was hanging.

Yeah, I'll try to reproduce it too...

On Sun, Oct 12, 2025 at 2:00 AM Alexander Lakhin <[email protected]> wrote:
I've managed to reproduce it using qemu-system-riscv64 with Debian trixie
Huh, that's interesting.  What is the host architecture?  When I saw
that error myself and wondered about memory order, I dismissed the
idea of trying with qemu, figuring that my x86 host's TSO would affect
the coherency, but thinking again about that... I guess the compiler
might still reorder during riscv codegen if there is something wrong
with the barrier support, and even if it doesn't, the binary
translation to x86 might also feel free to reorder stuff if there are
no barrier instructions to prevent it?  Or maybe that doesn't happen
but your host is ARM?

I use AMD Ryzen 9 7900X, Ubuntu 24.04 and run the risc machine with:
qemu-system-riscv64 -machine virt -m 1G -smp 8 -cpu rv64 -device 
virtio-blk-device,drive=hd \
-drive file=.../dqib_riscv64-virt/image.qcow2,if=none,id=hd -device 
virtio-net-device,netdev=net \
-netdev user,id=net,hostfwd=tcp::22222-:22 -bios 
/usr/lib/riscv64-linux-gnu/opensbi/generic/fw_jump.elf \
-kernel /usr/lib/u-boot/qemu-riscv64_smode/uboot.elf -object 
rng-random,filename=/dev/urandom,id=rng \
-device virtio-rng-device,rng=rng -nographic -append "root=LABEL=rootfs 
console=ttyS0"

I don't know what hardware greenfly uses (CC'ing Greg in case he'd like to
share some info on this), but timings are similar to what I'm seeing:
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=greenfly&dt=2025-10-09%2000%3A06%3A03&stg=check
ok 160       - select_parallel                          5019 ms

`make check` on mine, with select_parallel repeated:
ok 160       - select_parallel                          6042 ms
ok 161       - select_parallel                          6089 ms
ok 162       - select_parallel                          6116 ms

copperhead and boomslang, which are using real RISC hardware [1], show
better timings:
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=copperhead&dt=2025-10-11%2019%3A36%3A33&stg=check
ok 160       - select_parallel                          2677 ms

https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=boomslang&dt=2025-10-12%2002%3A17%3A53&stg=check
ok 160       - select_parallel                          3651 ms

Thank you for looking into this!

[1] 
https://www.postgresql.org/message-id/3db97903-884f-4b0c-b1cd-d7442e71ea75%40app.fastmail.com

Best regards,
Alexander
GNU gdb (Debian 15.2-1) 15.2
Copyright (C) 2024 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "riscv64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from tmp_install/usr/local/pgsql/bin/postgres...
(gdb) Dump of assembler code for function pgaio_io_update_state:
341             Assert(ioh >= pgaio_ctl->io_handles &&
   0x0000000000436490 <+90>:    auipc   a0,0x4c3
   0x0000000000436494 <+94>:    ld      a1,-616(a0) # 0x8f9228 <pgaio_ctl>
   0x0000000000436498 <+98>:    ld      a0,48(a1)
   0x000000000043649a <+100>:   bltu    s1,a0,0x436552 
<pgaio_io_update_state+284>
   0x000000000043649e <+104>:   lwu     a1,40(a1)
   0x00000000004364a2 <+108>:   li      a2,144
   0x00000000004364a6 <+112>:   mul     a1,a1,a2
   0x00000000004364aa <+116>:   add     a1,a1,a0
   0x00000000004364ac <+118>:   bgeu    s1,a1,0x436552 
<pgaio_io_update_state+284>
   0x00000000004364b0 <+122>:   auipc   a1,0x4c1
   0x00000000004364b4 <+126>:   ld      s3,280(a1) # 0x8f75c8
   0x0000000000436552 <+284>:   auipc   a0,0x242
   0x0000000000436556 <+288>:   addi    a0,a0,549 # 0x678777
   0x000000000043655a <+292>:   auipc   a1,0x242
   0x000000000043655e <+296>:   addi    a1,a1,163 # 0x6785fd
   0x0000000000436562 <+300>:   li      a2,342
   0x0000000000436566 <+304>:   auipc   ra,0x160
   0x000000000043656a <+308>:   jalr    994(ra) # 0x596948 
<ExceptionalCondition>

342                        ioh < (pgaio_ctl->io_handles + 
pgaio_ctl->io_handle_count));
343             return ioh - pgaio_ctl->io_handles;
   0x00000000004364b8 <+130>:   sub     s0,s1,a0
   0x00000000004364bc <+134>:   srai    s0,s0,0x4

344     }
345     
346     /*
347      * Return the ProcNumber for the process that can use an IO handle. The
348      * mapping from IO handles to PGPROCs is static, therefore this even 
works
349      * when the corresponding PGPROC is not in use.
350      */
351     ProcNumber
352     pgaio_io_get_owner(PgAioHandle *ioh)
353     {
354             return ioh->owner_procno;
355     }
356     
357     /*
358      * Return a wait reference for the IO. Only wait references can be used 
to
359      * wait for an IOs completion, as handles themselves can be reused after
360      * completion.  See also the comment above pgaio_io_acquire().
361      */
362     void
363     pgaio_io_get_wref(PgAioHandle *ioh, PgAioWaitRef *iow)
364     {
365             Assert(ioh->state == PGAIO_HS_HANDED_OUT ||
366                        ioh->state == PGAIO_HS_DEFINED ||
367                        ioh->state == PGAIO_HS_STAGED);
368             Assert(ioh->generation != 0);
369     
370             iow->aio_index = ioh - pgaio_ctl->io_handles;
371             iow->generation_upper = (uint32) (ioh->generation >> 32);
372             iow->generation_lower = (uint32) ioh->generation;
373     }
374     
375     
376     
377     /* 
--------------------------------------------------------------------------------
378      * Internal Functions related to PgAioHandle
379      * 
--------------------------------------------------------------------------------
380      */
381     
382     static inline void
383     pgaio_io_update_state(PgAioHandle *ioh, PgAioHandleState new_state)
384     {
   0x0000000000436436 <+0>:     addi    sp,sp,-48
   0x0000000000436438 <+2>:     sd      ra,40(sp)
   0x000000000043643a <+4>:     sd      s0,32(sp)
   0x000000000043643c <+6>:     sd      s1,24(sp)
   0x000000000043643e <+8>:     sd      s2,16(sp)
   0x0000000000436440 <+10>:    sd      s3,8(sp)
   0x0000000000436442 <+12>:    sd      s4,0(sp)

385             /*
386              * All callers need to have held interrupts in some form, 
otherwise
387              * interrupt processing could wait for the IO to complete, 
while in an
388              * intermediary state.
389              */
390             Assert(!INTERRUPTS_CAN_BE_PROCESSED());
   0x0000000000436444 <+14>:    auipc   a2,0x4a6
   0x0000000000436448 <+18>:    ld      a2,564(a2) # 0x8dc678
   0x000000000043644c <+22>:    lw      a2,0(a2)
   0x000000000043644e <+24>:    mv      s4,a1
   0x0000000000436450 <+26>:    mv      s1,a0
   0x0000000000436452 <+28>:    bnez    a2,0x43646e <pgaio_io_update_state+56>
   0x0000000000436454 <+30>:    auipc   a0,0x4a6
   0x0000000000436458 <+34>:    ld      a0,508(a0) # 0x8dc650
   0x000000000043645c <+38>:    lw      a0,0(a0)
   0x000000000043645e <+40>:    bnez    a0,0x43646e <pgaio_io_update_state+56>
   0x0000000000436460 <+42>:    auipc   a0,0x4a6
   0x0000000000436464 <+46>:    ld      a0,-1616(a0) # 0x8dbe10
   0x0000000000436468 <+50>:    lw      a0,0(a0)
   0x000000000043646a <+52>:    beqz    a0,0x43656e <pgaio_io_update_state+312>
   0x000000000043656e <+312>:   auipc   a0,0x242
   0x0000000000436572 <+316>:   addi    a0,a0,1670 # 0x678bf4
   0x0000000000436576 <+320>:   auipc   a1,0x242
   0x000000000043657a <+324>:   addi    a1,a1,135 # 0x6785fd
   0x000000000043657e <+328>:   li      a2,390
   0x0000000000436582 <+332>:   auipc   ra,0x160
   0x0000000000436586 <+336>:   jalr    966(ra) # 0x596948 
<ExceptionalCondition>

391     
392             pgaio_debug_io(DEBUG5, ioh,
   0x000000000043646e <+56>:    li      a0,10
   0x0000000000436470 <+58>:    li      a1,0
   0x0000000000436472 <+60>:    auipc   ra,0x161
   0x0000000000436476 <+64>:    jalr    -1200(ra) # 0x596fc2 <errstart>
   0x000000000043647a <+68>:    beqz    a0,0x43653a <pgaio_io_update_state+260>
   0x000000000043647c <+70>:    li      a0,1
   0x000000000043647e <+72>:    auipc   ra,0x163
   0x0000000000436482 <+76>:    jalr    1264(ra) # 0x59996e <errhidestmt>
   0x0000000000436486 <+80>:    li      a0,1
   0x0000000000436488 <+82>:    auipc   ra,0x163
   0x000000000043648c <+86>:    jalr    1356(ra) # 0x5999d4 <errhidecontext>
   0x00000000004364be <+136>:   mv      a0,s1
   0x00000000004364c0 <+138>:   jal     0x438f90 <pgaio_io_get_op_name>
   0x00000000004364c4 <+142>:   mv      s2,a0
   0x00000000004364c6 <+144>:   mv      a0,s1
   0x00000000004364c8 <+146>:   jal     0x439076 <pgaio_io_get_target_name>
   0x00000000004364d2 <+156>:   mv      a3,a0
   0x00000000004364e6 <+176>:   mulw    a1,s0,s3
   0x0000000000436502 <+204>:   mulw    a1,s0,s3
   0x000000000043650c <+214>:   auipc   a0,0x242
   0x0000000000436510 <+218>:   addi    a0,a0,1799 # 0x678c13
   0x0000000000436514 <+222>:   mv      a2,s2
   0x0000000000436516 <+224>:   auipc   ra,0x161
   0x000000000043651a <+228>:   jalr    -52(ra) # 0x5974e2 <errmsg_internal>
   0x000000000043651e <+232>:   auipc   a0,0x242
   0x0000000000436522 <+236>:   addi    a0,a0,223 # 0x6785fd
   0x0000000000436526 <+240>:   auipc   a1,0x242
   0x000000000043652a <+244>:   addi    a2,a1,1836 # 0x678c52
   0x000000000043652e <+248>:   li      a1,394
   0x0000000000436532 <+252>:   auipc   ra,0x161
   0x0000000000436536 <+256>:   jalr    -694(ra) # 0x59727c <errfinish>

393                                        "updating state to %s",
394                                        pgaio_io_state_get_name(new_state));
395     
396             /*
397              * Ensure the changes signified by the new state are visible 
before the
398              * new state becomes visible.
399              */
400             pg_write_barrier();
   0x000000000043653a <+260>:   fence   rw,w

401     
402             ioh->state = new_state;
   0x000000000043653e <+264>:   sb      s4,0(s1)
   0x0000000000436542 <+268>:   ld      ra,40(sp)
   0x0000000000436544 <+270>:   ld      s0,32(sp)
   0x0000000000436546 <+272>:   ld      s1,24(sp)
   0x0000000000436548 <+274>:   ld      s2,16(sp)
   0x000000000043654a <+276>:   ld      s3,8(sp)
   0x000000000043654c <+278>:   ld      s4,0(sp)

403     }
   0x000000000043654e <+280>:   addi    sp,sp,48
   0x0000000000436550 <+282>:   ret

404     
405     static void
406     pgaio_io_resowner_register(PgAioHandle *ioh)
407     {
408             Assert(!ioh->resowner);
409             Assert(CurrentResourceOwner);
410     
411             ResourceOwnerRememberAioHandle(CurrentResourceOwner, 
&ioh->resowner_node);
412             ioh->resowner = CurrentResourceOwner;
413     }
414     
415     /*
416      * Stage IO for execution and, if appropriate, submit it immediately.
417      *
418      * Should only be called from pgaio_io_start_*().
419      */
420     void
421     pgaio_io_stage(PgAioHandle *ioh, PgAioOp op)
422     {
423             bool            needs_synchronous;
424     
425             Assert(ioh->state == PGAIO_HS_HANDED_OUT);
426             Assert(pgaio_my_backend->handed_out_io == ioh);
427             Assert(pgaio_io_has_target(ioh));
428     
429             /*
430              * Otherwise an interrupt, in the middle of staging and 
possibly executing
431              * the IO, could end up trying to wait for the IO, leading to 
state
432              * confusion.
433              */
434             HOLD_INTERRUPTS();
435     
436             ioh->op = op;
437             ioh->result = 0;
438     
439             pgaio_io_update_state(ioh, PGAIO_HS_DEFINED);
440     
441             /* allow a new IO to be staged */
442             pgaio_my_backend->handed_out_io = NULL;
443     
444             pgaio_io_call_stage(ioh);
445     
446             pgaio_io_update_state(ioh, PGAIO_HS_STAGED);
447     
448             /*
449              * Synchronous execution has to be executed, well, 
synchronously, so check
450              * that first.
451              */
452             needs_synchronous = pgaio_io_needs_synchronous_execution(ioh);
453     
454             pgaio_debug_io(DEBUG3, ioh,
455                                        "staged (synchronous: %d, in_batch: 
%d)",
456                                        needs_synchronous, 
pgaio_my_backend->in_batchmode);
457     
458             if (!needs_synchronous)
459             {
460                     
pgaio_my_backend->staged_ios[pgaio_my_backend->num_staged_ios++] = ioh;
461                     Assert(pgaio_my_backend->num_staged_ios <= 
PGAIO_SUBMIT_BATCH_SIZE);
462     
463                     /*
464                      * Unless code explicitly opted into batching IOs, 
submit the IO
465                      * immediately.
466                      */
467                     if (!pgaio_my_backend->in_batchmode)
468                             pgaio_submit_staged();
469             }
470             else
471             {
472                     pgaio_io_prepare_submit(ioh);
473                     pgaio_io_perform_synchronously(ioh);
474             }
475     
476             RESUME_INTERRUPTS();
477     }
478     
479     bool
480     pgaio_io_needs_synchronous_execution(PgAioHandle *ioh)
481     {
482             /*
483              * If the caller said to execute the IO synchronously, do so.
484              *
485              * XXX: We could optimize the logic when to execute 
synchronously by first
486              * checking if there are other IOs in flight and only 
synchronously
487              * executing if not. Unclear whether that'll be sufficiently 
common to be
488              * worth worrying about.
489              */
490             if (ioh->flags & PGAIO_HF_SYNCHRONOUS)
491                     return true;
492     
493             /* Check if the IO method requires synchronous execution of IO 
*/
494             if (pgaio_method_ops->needs_synchronous_execution)
495                     return 
pgaio_method_ops->needs_synchronous_execution(ioh);
496     
497             return false;
498     }
499     
500     /*
501      * Handle IO being processed by IO method.
502      *
503      * Should be called by IO methods / synchronous IO execution, just 
before the
504      * IO is performed.
505      */
506     void
507     pgaio_io_prepare_submit(PgAioHandle *ioh)
508     {
509             pgaio_io_update_state(ioh, PGAIO_HS_SUBMITTED);
510     
511             dclist_push_tail(&pgaio_my_backend->in_flight_ios, &ioh->node);
512     }
513     
514     /*
515      * Handle IO getting completed by a method.
516      *
517      * Should be called by IO methods / synchronous IO execution, just 
after the
518      * IO has been performed.
519      *
520      * Expects to be called in a critical section. We expect IOs to be 
usable for
521      * WAL etc, which requires being able to execute completion callbacks 
in a
522      * critical section.
523      */
524     void
525     pgaio_io_process_completion(PgAioHandle *ioh, int result)
526     {
527             Assert(ioh->state == PGAIO_HS_SUBMITTED);
528     
529             Assert(CritSectionCount > 0);
530     
531             ioh->result = result;
532     
533             pgaio_io_update_state(ioh, PGAIO_HS_COMPLETED_IO);
534     
535             INJECTION_POINT("aio-process-completion-before-shared", ioh);
536     
537             pgaio_io_call_complete_shared(ioh);
538     
539             pgaio_io_update_state(ioh, PGAIO_HS_COMPLETED_SHARED);
540     
541             /* condition variable broadcast ensures state is visible before 
wakeup */
542             ConditionVariableBroadcast(&ioh->cv);
543     
544             /* contains call to pgaio_io_call_complete_local() */
545             if (ioh->owner_procno == MyProcNumber)
546                     pgaio_io_reclaim(ioh);
547     }
548     
549     /*
550      * Has the IO completed and thus the IO handle been reused?
551      *
552      * This is useful when waiting for IO completion at a low level (e.g. 
in an IO
553      * method's ->wait_one() callback).
554      */
555     bool
556     pgaio_io_was_recycled(PgAioHandle *ioh, uint64 ref_generation, 
PgAioHandleState *state)
557     {
558             *state = ioh->state;
559     
560             /*
561              * Ensure that we don't see an earlier state of the handle than 
ioh->state
562              * due to compiler or CPU reordering. This protects both 
->generation as
563              * directly used here, and other fields in the handle accessed 
in the
564              * caller if the handle was not reused.
565              */
566             pg_read_barrier();
567     
568             return ioh->generation != ref_generation;
569     }
570     
571     /*
572      * Wait for IO to complete. External code should never use this, 
outside of
573      * the AIO subsystem waits are only allowed via pgaio_wref_wait().
574      */
575     static void
576     pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation)
577     {
578             PgAioHandleState state;
579             bool            am_owner;
580     
581             am_owner = ioh->owner_procno == MyProcNumber;
582     
583             if (pgaio_io_was_recycled(ioh, ref_generation, &state))
584                     return;
585     
586             if (am_owner)
587             {
588                     if (state != PGAIO_HS_SUBMITTED
589                             && state != PGAIO_HS_COMPLETED_IO
590                             && state != PGAIO_HS_COMPLETED_SHARED
591                             && state != PGAIO_HS_COMPLETED_LOCAL)
592                     {
593                             elog(PANIC, "waiting for own IO %d in wrong 
state: %s",
594                                      pgaio_io_get_id(ioh), 
pgaio_io_get_state_name(ioh));
595                     }
596             }
597     
598             while (true)
599             {
600                     if (pgaio_io_was_recycled(ioh, ref_generation, &state))
601                             return;
602     
603                     switch ((PgAioHandleState) state)
604                     {
605                             case PGAIO_HS_IDLE:
606                             case PGAIO_HS_HANDED_OUT:
607                                     elog(ERROR, "IO in wrong state: %d", 
state);
608                                     break;
609     
610                             case PGAIO_HS_SUBMITTED:
611     
612                                     /*
613                                      * If we need to wait via the IO 
method, do so now. Don't
614                                      * check via the IO method if the 
issuing backend is executing
615                                      * the IO synchronously.
616                                      */
617                                     if (pgaio_method_ops->wait_one && 
!(ioh->flags & PGAIO_HF_SYNCHRONOUS))
618                                     {
619                                             pgaio_method_ops->wait_one(ioh, 
ref_generation);
620                                             continue;
621                                     }
622                                     /* fallthrough */
623     
624                                     /* waiting for owner to submit */
625                             case PGAIO_HS_DEFINED:
626                             case PGAIO_HS_STAGED:
627                                     /* waiting for reaper to complete */
628                                     /* fallthrough */
629                             case PGAIO_HS_COMPLETED_IO:
630                                     /* shouldn't be able to hit this 
otherwise */
631                                     Assert(IsUnderPostmaster);
632                                     /* ensure we're going to get woken up */
633                                     
ConditionVariablePrepareToSleep(&ioh->cv);
634     
635                                     while (!pgaio_io_was_recycled(ioh, 
ref_generation, &state))
636                                     {
637                                             if (state == 
PGAIO_HS_COMPLETED_SHARED ||
638                                                     state == 
PGAIO_HS_COMPLETED_LOCAL)
639                                                     break;
640                                             
ConditionVariableSleep(&ioh->cv, WAIT_EVENT_AIO_IO_COMPLETION);
641                                     }
642     
643                                     ConditionVariableCancelSleep();
644                                     break;
645     
646                             case PGAIO_HS_COMPLETED_SHARED:
647                             case PGAIO_HS_COMPLETED_LOCAL:
648     
649                                     /*
650                                      * Note that no interrupts are 
processed between
651                                      * pgaio_io_was_recycled() and this 
check - that's important
652                                      * as otherwise an interrupt could have 
already reclaimed the
653                                      * handle.
654                                      */
655                                     if (am_owner)
656                                             pgaio_io_reclaim(ioh);
657                                     return;
658                     }
659             }
660     }
661     
662     /*
663      * Make IO handle ready to be reused after IO has completed or after the
664      * handle has been released without being used.
665      *
666      * Note that callers need to be careful about only calling this in the 
right
667      * state and that no interrupts can be processed between the state 
check and
668      * the call to pgaio_io_reclaim(). Otherwise interrupt processing could
669      * already have reclaimed the handle.
670      */
671     static void
672     pgaio_io_reclaim(PgAioHandle *ioh)
673     {
674             /* This is only ok if it's our IO */
675             Assert(ioh->owner_procno == MyProcNumber);
676             Assert(ioh->state != PGAIO_HS_IDLE);
677     
678             /* see comment in function header */
679             HOLD_INTERRUPTS();
680     
681             /*
682              * It's a bit ugly, but right now the easiest place to put the 
execution
683              * of local completion callbacks is this function, as we need 
to execute
684              * local callbacks just before reclaiming at multiple callsites.
685              */
686             if (ioh->state == PGAIO_HS_COMPLETED_SHARED)
687             {
688                     PgAioResult local_result;
689     
690                     local_result = pgaio_io_call_complete_local(ioh);
691                     pgaio_io_update_state(ioh, PGAIO_HS_COMPLETED_LOCAL);
692     
693                     if (ioh->report_return)
694                     {
695                             ioh->report_return->result = local_result;
696                             ioh->report_return->target_data = 
ioh->target_data;
697                     }
698             }
699     
700             pgaio_debug_io(DEBUG4, ioh,
701                                        "reclaiming: distilled_result: 
(status %s, id %u, error_data %d), raw_result: %d",
702                                        
pgaio_result_status_string(ioh->distilled_result.status),
703                                        ioh->distilled_result.id,
704                                        ioh->distilled_result.error_data,
705                                        ioh->result);
706     
707             /* if the IO has been defined, it's on the in-flight list, 
remove */
708             if (ioh->state != PGAIO_HS_HANDED_OUT)
709                     dclist_delete_from(&pgaio_my_backend->in_flight_ios, 
&ioh->node);
710     
711             if (ioh->resowner)
712             {
713                     ResourceOwnerForgetAioHandle(ioh->resowner, 
&ioh->resowner_node);
714                     ioh->resowner = NULL;
715             }
716     
717             Assert(!ioh->resowner);
718     
719             /*
720              * Update generation & state first, before resetting the IO's 
fields,
721              * otherwise a concurrent "viewer" could think the fields are 
valid, even
722              * though they are being reset.  Increment the generation 
first, so that
723              * we can assert elsewhere that we never wait for an IDLE IO.  
While it's
724              * a bit weird for the state to go backwards for a generation, 
it's OK
725              * here, as there cannot be references to the "reborn" IO yet.  
Can't
726              * update both at once, so something has to give.
727              */
728             ioh->generation++;
729             pgaio_io_update_state(ioh, PGAIO_HS_IDLE);
730     
731             /* ensure the state update is visible before we reset fields */
732             pg_write_barrier();
733     
734             ioh->op = PGAIO_OP_INVALID;
735             ioh->target = PGAIO_TID_INVALID;
736             ioh->flags = 0;
737             ioh->num_callbacks = 0;
738             ioh->handle_data_len = 0;
739             ioh->report_return = NULL;
740             ioh->result = 0;
741             ioh->distilled_result.status = PGAIO_RS_UNKNOWN;
742     
743             /*
744              * We push the IO to the head of the idle IO list, that seems 
more cache
745              * efficient in cases where only a few IOs are used.
746              */
747             dclist_push_head(&pgaio_my_backend->idle_ios, &ioh->node);
748     
749             RESUME_INTERRUPTS();
750     }
751     
752     /*
753      * Wait for an IO handle to become usable.
754      *
755      * This only really is useful for pgaio_io_acquire().
756      */
757     static void
758     pgaio_io_wait_for_free(void)
759     {
760             int                     reclaimed = 0;
761     
762             pgaio_debug(DEBUG2, "waiting for free IO with %d pending, %u 
in-flight, %u idle IOs",
763                                     pgaio_my_backend->num_staged_ios,
764                                     
dclist_count(&pgaio_my_backend->in_flight_ios),
765                                     
dclist_count(&pgaio_my_backend->idle_ios));
766     
767             /*
768              * First check if any of our IOs actually have completed - when 
using
769              * worker, that'll often be the case. We could do so as part of 
the loop
770              * below, but that'd potentially lead us to wait for some IO 
submitted
771              * before.
772              */
773             for (int i = 0; i < io_max_concurrency; i++)
774             {
775                     PgAioHandle *ioh = 
&pgaio_ctl->io_handles[pgaio_my_backend->io_handle_off + i];
776     
777                     if (ioh->state == PGAIO_HS_COMPLETED_SHARED)
778                     {
779                             /*
780                              * Note that no interrupts are processed 
between the state check
781                              * and the call to reclaim - that's important 
as otherwise an
782                              * interrupt could have already reclaimed the 
handle.
783                              *
784                              * Need to ensure that there's no reordering, 
in the more common
785                              * paths, where we wait for IO, that's done by
786                              * pgaio_io_was_recycled().
787                              */
788                             pg_read_barrier();
789                             pgaio_io_reclaim(ioh);
790                             reclaimed++;
791                     }
792             }
793     
794             if (reclaimed > 0)
795                     return;
796     
797             /*
798              * If we have any unsubmitted IOs, submit them now. We'll start 
waiting in
799              * a second, so it's better they're in flight. This also 
addresses the
800              * edge-case that all IOs are unsubmitted.
801              */
802             if (pgaio_my_backend->num_staged_ios > 0)
803                     pgaio_submit_staged();
804     
805             /* possibly some IOs finished during submission */
806             if (!dclist_is_empty(&pgaio_my_backend->idle_ios))
807                     return;
808     
809             if (dclist_count(&pgaio_my_backend->in_flight_ios) == 0)
810                     ereport(ERROR,
811                                     errmsg_internal("no free IOs despite no 
in-flight IOs"),
812                                     errdetail_internal("%d pending, %u 
in-flight, %u idle IOs",
813                                                                        
pgaio_my_backend->num_staged_ios,
814                                                                        
dclist_count(&pgaio_my_backend->in_flight_ios),
815                                                                        
dclist_count(&pgaio_my_backend->idle_ios)));
816     
817             /*
818              * Wait for the oldest in-flight IO to complete.
819              *
820              * XXX: Reusing the general IO wait is suboptimal, we don't 
need to wait
821              * for that specific IO to complete, we just need *any* IO to 
complete.
822              */
823             {
824                     PgAioHandle *ioh = dclist_head_element(PgAioHandle, 
node,
825                                                                             
                   &pgaio_my_backend->in_flight_ios);
826                     uint64          generation = ioh->generation;
827     
828                     switch ((PgAioHandleState) ioh->state)
829                     {
830                                     /* should not be in in-flight list */
831                             case PGAIO_HS_IDLE:
832                             case PGAIO_HS_DEFINED:
833                             case PGAIO_HS_HANDED_OUT:
834                             case PGAIO_HS_STAGED:
835                             case PGAIO_HS_COMPLETED_LOCAL:
836                                     elog(ERROR, "shouldn't get here with 
io:%d in state %d",
837                                              pgaio_io_get_id(ioh), 
ioh->state);
838                                     break;
839     
840                             case PGAIO_HS_COMPLETED_IO:
841                             case PGAIO_HS_SUBMITTED:
842                                     pgaio_debug_io(DEBUG2, ioh,
843                                                                "waiting for 
free io with %u in flight",
844                                                                
dclist_count(&pgaio_my_backend->in_flight_ios));
845     
846                                     /*
847                                      * In a more general case this would be 
racy, because the
848                                      * generation could increase after we 
read ioh->state above.
849                                      * But we are only looking at IOs by 
the current backend and
850                                      * the IO can only be recycled by this 
backend.  Even this is
851                                      * only OK because we get the handle's 
generation before
852                                      * potentially processing interrupts, 
e.g. as part of
853                                      * pgaio_debug_io().
854                                      */
855                                     pgaio_io_wait(ioh, generation);
856                                     break;
857     
858                             case PGAIO_HS_COMPLETED_SHARED:
859     
860                                     /*
861                                      * It's possible that another backend 
just finished this IO.
862                                      *
863                                      * Note that no interrupts are 
processed between the state
864                                      * check and the call to reclaim - 
that's important as
865                                      * otherwise an interrupt could have 
already reclaimed the
866                                      * handle.
867                                      *
868                                      * Need to ensure that there's no 
reordering, in the more
869                                      * common paths, where we wait for IO, 
that's done by
870                                      * pgaio_io_was_recycled().
871                                      */
872                                     pg_read_barrier();
873                                     pgaio_io_reclaim(ioh);
874                                     break;
875                     }
876     
877                     if (dclist_count(&pgaio_my_backend->idle_ios) == 0)
878                             elog(PANIC, "no idle IO after waiting for IO to 
terminate");
879                     return;
880             }
881     }
882     
883     /*
884      * Internal - code outside of AIO should never need this and it'd be 
hard for
885      * such code to be safe.
886      */
887     static PgAioHandle *
888     pgaio_io_from_wref(PgAioWaitRef *iow, uint64 *ref_generation)
889     {
890             PgAioHandle *ioh;
891     
892             Assert(iow->aio_index < pgaio_ctl->io_handle_count);
893     
894             ioh = &pgaio_ctl->io_handles[iow->aio_index];
895     
896             *ref_generation = ((uint64) iow->generation_upper) << 32 |
897                     iow->generation_lower;
898     
899             Assert(*ref_generation != 0);
900     
901             return ioh;
902     }
903     
904     static const char *
905     pgaio_io_state_get_name(PgAioHandleState s)
906     {
907     #define PGAIO_HS_TOSTR_CASE(sym) case PGAIO_HS_##sym: return #sym
908             switch ((PgAioHandleState) s)
   0x00000000004364d4 <+158>:   bltu    a2,a1,0x436500 
<pgaio_io_update_state+202>
   0x00000000004364d8 <+162>:   slli    a1,a1,0x3
   0x00000000004364da <+164>:   auipc   a0,0x485
   0x00000000004364de <+168>:   addi    a0,a0,-530 # 0x8bb2c8
   0x00000000004364e2 <+172>:   add     a0,a0,a1
   0x00000000004364e4 <+174>:   ld      a4,0(a0)
   0x00000000004364ea <+180>:   bltu    a2,s4,0x43650a 
<pgaio_io_update_state+212>
   0x00000000004364ee <+184>:   slli    a0,s4,0x3
   0x00000000004364f2 <+188>:   auipc   a2,0x485
   0x00000000004364f6 <+192>:   addi    a2,a2,-554 # 0x8bb2c8
   0x00000000004364fa <+196>:   add     a0,a0,a2
   0x00000000004364fc <+198>:   ld      a5,0(a0)
   0x00000000004364fe <+200>:   j       0x43650c <pgaio_io_update_state+214>
   0x0000000000436500 <+202>:   li      a4,0
   0x0000000000436506 <+208>:   bgeu    a2,s4,0x4364ee 
<pgaio_io_update_state+184>
   0x000000000043650a <+212>:   li      a5,0

909             {
910                             PGAIO_HS_TOSTR_CASE(IDLE);
911                             PGAIO_HS_TOSTR_CASE(HANDED_OUT);
912                             PGAIO_HS_TOSTR_CASE(DEFINED);
913                             PGAIO_HS_TOSTR_CASE(STAGED);
914                             PGAIO_HS_TOSTR_CASE(SUBMITTED);
915                             PGAIO_HS_TOSTR_CASE(COMPLETED_IO);
916                             PGAIO_HS_TOSTR_CASE(COMPLETED_SHARED);
917                             PGAIO_HS_TOSTR_CASE(COMPLETED_LOCAL);
918             }
919     #undef PGAIO_HS_TOSTR_CASE
920     
921             return NULL;                            /* silence compiler */
922     }
923     
924     const char *
925     pgaio_io_get_state_name(PgAioHandle *ioh)
926     {
927             return pgaio_io_state_get_name(ioh->state);
   0x00000000004364cc <+150>:   lbu     a1,0(s1)
   0x00000000004364d0 <+154>:   li      a2,7

End of assembler dump.
(gdb) 
GNU gdb (Debian 15.2-1) 15.2
Copyright (C) 2024 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "riscv64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from tmp_install/usr/local/pgsql/bin/postgres...
(gdb) Dump of assembler code for function pgaio_io_was_recycled:
558             *state = ioh->state;
   0x0000000000436f98 <+0>:     lbu     a3,0(a0)
   0x0000000000436f9c <+4>:     sw      a3,0(a2)

559     
560             /*
561              * Ensure that we don't see an earlier state of the handle than 
ioh->state
562              * due to compiler or CPU reordering. This protects both 
->generation as
563              * directly used here, and other fields in the handle accessed 
in the
564              * caller if the handle was not reused.
565              */
566             pg_read_barrier();
   0x0000000000436f9e <+6>:     fence   r,rw

567     
568             return ioh->generation != ref_generation;
   0x0000000000436fa2 <+10>:    ld      a0,64(a0)
   0x0000000000436fa4 <+12>:    xor     a0,a0,a1
   0x0000000000436fa6 <+14>:    snez    a0,a0
   0x0000000000436faa <+18>:    ret

End of assembler dump.
(gdb) 

Reply via email to