Hello Thomas,
12.10.2025 06:35, Thomas Munro wrote:
On Sun, Oct 12, 2025 at 2:00 AM Alexander Lakhin <[email protected]> wrote:
2025-10-11 11:34:46.793 GMT [1169773:1] PANIC: !!!pgaio_io_wait| ioh->state
changed from 0 to 1 at iteration 0
# no other iteration number observed
Can you please disassemble pgaio_io_update_state() and
pgaio_io_was_recycled()? I wonder if the memory barriers are not
being generated correctly, causing the state and generation to be
loaded out of order, or something like that...
Please find those attached (gdb "disass/m pgaio_io_update_state" misses
the start of the function (but it's still disassembled below), so I
decided to share the whole output).
This is from clean master, without any modifications, but with the issue
confirmed for this build:
2025-10-11 20:29:11.724 UTC [1679534:1] [unknown] LOG: connection received:
host=[local]
2025-10-11 20:29:11.725 UTC [1679536:1] [unknown] LOG: connection received:
host=[local]
2025-10-11 20:29:11.726 UTC [1679538:1] [unknown] LOG: connection received:
host=[local]
2025-10-11 20:29:11.724 UTC [1679533:1] [unknown] LOG: connection received:
host=[local]
2025-10-11 20:29:11.724 UTC [1679537:1] [unknown] LOG: connection received:
host=[local]
2025-10-11 20:29:11.724 UTC [1679535:1] [unknown] LOG: connection received:
host=[local]
2025-10-11 20:29:11.729 UTC [1679539:1] [unknown] LOG: connection received:
host=[local
...
2025-10-11 20:29:11.778 UTC [1679537:3] [unknown] LOG: connection authorized: user=debian database=regression
application_name=pg_regress/create_schema
2025-10-11 20:29:11.778 UTC [1679533:3] [unknown] LOG: connection authorized: user=debian database=regression
application_name=pg_regress/create_type
2025-10-11 20:29:11.778 UTC [1679539:3] [unknown] LOG: connection authorized: user=debian database=regression
application_name=pg_regress/create_procedure
2025-10-11 20:29:11.778 UTC [1679536:3] [unknown] LOG: connection authorized: user=debian database=regression
application_name=pg_regress/create_misc
2025-10-11 20:29:11.778 UTC [1679538:3] [unknown] LOG: connection authorized: user=debian database=regression
application_name=pg_regress/create_table
2025-10-11 20:29:11.778 UTC [1679535:3] [unknown] LOG: connection authorized: user=debian database=regression
application_name=pg_regress/create_function_c
2025-10-11 20:29:11.790 UTC [1679534:2] [unknown] FATAL: IO in wrong state: 0
2025-10-11 20:29:11.790 UTC [1679534:3] [unknown] BACKTRACE:
postgres: debian regression [local] authentication(+0x4371ec)
[0x555565cd61ec]
postgres: debian regression [local] authentication(+0x442bb2)
[0x555565ce1bb2]
postgres: debian regression [local] authentication(StartBufferIO+0x66)
[0x555565ce1a10]
postgres: debian regression [local] authentication(+0x43df60)
[0x555565cdcf60]
postgres: debian regression [local]
authentication(StartReadBuffer+0x308) [0x555565cdc8a4]
postgres: debian regression [local]
authentication(ReadBufferExtended+0x64) [0x555565cdaace]
...
postgres: debian regression [local]
authentication(postmaster_child_launch+0x132) [0x555565c7787c]
postgres: debian regression [local] authentication(+0x3dcd52)
[0x555565c7bd52]
postgres: debian regression [local]
authentication(InitProcessGlobals+0) [0x555565c79c96]
postgres: debian regression [local] authentication(+0x3194b2)
[0x555565bb84b2]
/lib/riscv64-linux-gnu/libc.so.6(+0x2791c) [0x7fffa690891c]
/lib/riscv64-linux-gnu/libc.so.6(__libc_start_main+0x74)
[0x7fffa69089c4]
postgres: debian regression [local] authentication(_start+0x20)
[0x5555659842c8]
The previous failure on greenfly was a TIMEOUT in the same test, as if
a query was hanging.
Yeah, I'll try to reproduce it too...
On Sun, Oct 12, 2025 at 2:00 AM Alexander Lakhin <[email protected]> wrote:
I've managed to reproduce it using qemu-system-riscv64 with Debian trixie
Huh, that's interesting. What is the host architecture? When I saw
that error myself and wondered about memory order, I dismissed the
idea of trying with qemu, figuring that my x86 host's TSO would affect
the coherency, but thinking again about that... I guess the compiler
might still reorder during riscv codegen if there is something wrong
with the barrier support, and even if it doesn't, the binary
translation to x86 might also feel free to reorder stuff if there are
no barrier instructions to prevent it? Or maybe that doesn't happen
but your host is ARM?
I use AMD Ryzen 9 7900X, Ubuntu 24.04 and run the risc machine with:
qemu-system-riscv64 -machine virt -m 1G -smp 8 -cpu rv64 -device
virtio-blk-device,drive=hd \
-drive file=.../dqib_riscv64-virt/image.qcow2,if=none,id=hd -device
virtio-net-device,netdev=net \
-netdev user,id=net,hostfwd=tcp::22222-:22 -bios
/usr/lib/riscv64-linux-gnu/opensbi/generic/fw_jump.elf \
-kernel /usr/lib/u-boot/qemu-riscv64_smode/uboot.elf -object
rng-random,filename=/dev/urandom,id=rng \
-device virtio-rng-device,rng=rng -nographic -append "root=LABEL=rootfs
console=ttyS0"
I don't know what hardware greenfly uses (CC'ing Greg in case he'd like to
share some info on this), but timings are similar to what I'm seeing:
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=greenfly&dt=2025-10-09%2000%3A06%3A03&stg=check
ok 160 - select_parallel 5019 ms
`make check` on mine, with select_parallel repeated:
ok 160 - select_parallel 6042 ms
ok 161 - select_parallel 6089 ms
ok 162 - select_parallel 6116 ms
copperhead and boomslang, which are using real RISC hardware [1], show
better timings:
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=copperhead&dt=2025-10-11%2019%3A36%3A33&stg=check
ok 160 - select_parallel 2677 ms
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=boomslang&dt=2025-10-12%2002%3A17%3A53&stg=check
ok 160 - select_parallel 3651 ms
Thank you for looking into this!
[1]
https://www.postgresql.org/message-id/3db97903-884f-4b0c-b1cd-d7442e71ea75%40app.fastmail.com
Best regards,
Alexander
GNU gdb (Debian 15.2-1) 15.2
Copyright (C) 2024 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "riscv64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from tmp_install/usr/local/pgsql/bin/postgres...
(gdb) Dump of assembler code for function pgaio_io_update_state:
341 Assert(ioh >= pgaio_ctl->io_handles &&
0x0000000000436490 <+90>: auipc a0,0x4c3
0x0000000000436494 <+94>: ld a1,-616(a0) # 0x8f9228 <pgaio_ctl>
0x0000000000436498 <+98>: ld a0,48(a1)
0x000000000043649a <+100>: bltu s1,a0,0x436552
<pgaio_io_update_state+284>
0x000000000043649e <+104>: lwu a1,40(a1)
0x00000000004364a2 <+108>: li a2,144
0x00000000004364a6 <+112>: mul a1,a1,a2
0x00000000004364aa <+116>: add a1,a1,a0
0x00000000004364ac <+118>: bgeu s1,a1,0x436552
<pgaio_io_update_state+284>
0x00000000004364b0 <+122>: auipc a1,0x4c1
0x00000000004364b4 <+126>: ld s3,280(a1) # 0x8f75c8
0x0000000000436552 <+284>: auipc a0,0x242
0x0000000000436556 <+288>: addi a0,a0,549 # 0x678777
0x000000000043655a <+292>: auipc a1,0x242
0x000000000043655e <+296>: addi a1,a1,163 # 0x6785fd
0x0000000000436562 <+300>: li a2,342
0x0000000000436566 <+304>: auipc ra,0x160
0x000000000043656a <+308>: jalr 994(ra) # 0x596948
<ExceptionalCondition>
342 ioh < (pgaio_ctl->io_handles +
pgaio_ctl->io_handle_count));
343 return ioh - pgaio_ctl->io_handles;
0x00000000004364b8 <+130>: sub s0,s1,a0
0x00000000004364bc <+134>: srai s0,s0,0x4
344 }
345
346 /*
347 * Return the ProcNumber for the process that can use an IO handle. The
348 * mapping from IO handles to PGPROCs is static, therefore this even
works
349 * when the corresponding PGPROC is not in use.
350 */
351 ProcNumber
352 pgaio_io_get_owner(PgAioHandle *ioh)
353 {
354 return ioh->owner_procno;
355 }
356
357 /*
358 * Return a wait reference for the IO. Only wait references can be used
to
359 * wait for an IOs completion, as handles themselves can be reused after
360 * completion. See also the comment above pgaio_io_acquire().
361 */
362 void
363 pgaio_io_get_wref(PgAioHandle *ioh, PgAioWaitRef *iow)
364 {
365 Assert(ioh->state == PGAIO_HS_HANDED_OUT ||
366 ioh->state == PGAIO_HS_DEFINED ||
367 ioh->state == PGAIO_HS_STAGED);
368 Assert(ioh->generation != 0);
369
370 iow->aio_index = ioh - pgaio_ctl->io_handles;
371 iow->generation_upper = (uint32) (ioh->generation >> 32);
372 iow->generation_lower = (uint32) ioh->generation;
373 }
374
375
376
377 /*
--------------------------------------------------------------------------------
378 * Internal Functions related to PgAioHandle
379 *
--------------------------------------------------------------------------------
380 */
381
382 static inline void
383 pgaio_io_update_state(PgAioHandle *ioh, PgAioHandleState new_state)
384 {
0x0000000000436436 <+0>: addi sp,sp,-48
0x0000000000436438 <+2>: sd ra,40(sp)
0x000000000043643a <+4>: sd s0,32(sp)
0x000000000043643c <+6>: sd s1,24(sp)
0x000000000043643e <+8>: sd s2,16(sp)
0x0000000000436440 <+10>: sd s3,8(sp)
0x0000000000436442 <+12>: sd s4,0(sp)
385 /*
386 * All callers need to have held interrupts in some form,
otherwise
387 * interrupt processing could wait for the IO to complete,
while in an
388 * intermediary state.
389 */
390 Assert(!INTERRUPTS_CAN_BE_PROCESSED());
0x0000000000436444 <+14>: auipc a2,0x4a6
0x0000000000436448 <+18>: ld a2,564(a2) # 0x8dc678
0x000000000043644c <+22>: lw a2,0(a2)
0x000000000043644e <+24>: mv s4,a1
0x0000000000436450 <+26>: mv s1,a0
0x0000000000436452 <+28>: bnez a2,0x43646e <pgaio_io_update_state+56>
0x0000000000436454 <+30>: auipc a0,0x4a6
0x0000000000436458 <+34>: ld a0,508(a0) # 0x8dc650
0x000000000043645c <+38>: lw a0,0(a0)
0x000000000043645e <+40>: bnez a0,0x43646e <pgaio_io_update_state+56>
0x0000000000436460 <+42>: auipc a0,0x4a6
0x0000000000436464 <+46>: ld a0,-1616(a0) # 0x8dbe10
0x0000000000436468 <+50>: lw a0,0(a0)
0x000000000043646a <+52>: beqz a0,0x43656e <pgaio_io_update_state+312>
0x000000000043656e <+312>: auipc a0,0x242
0x0000000000436572 <+316>: addi a0,a0,1670 # 0x678bf4
0x0000000000436576 <+320>: auipc a1,0x242
0x000000000043657a <+324>: addi a1,a1,135 # 0x6785fd
0x000000000043657e <+328>: li a2,390
0x0000000000436582 <+332>: auipc ra,0x160
0x0000000000436586 <+336>: jalr 966(ra) # 0x596948
<ExceptionalCondition>
391
392 pgaio_debug_io(DEBUG5, ioh,
0x000000000043646e <+56>: li a0,10
0x0000000000436470 <+58>: li a1,0
0x0000000000436472 <+60>: auipc ra,0x161
0x0000000000436476 <+64>: jalr -1200(ra) # 0x596fc2 <errstart>
0x000000000043647a <+68>: beqz a0,0x43653a <pgaio_io_update_state+260>
0x000000000043647c <+70>: li a0,1
0x000000000043647e <+72>: auipc ra,0x163
0x0000000000436482 <+76>: jalr 1264(ra) # 0x59996e <errhidestmt>
0x0000000000436486 <+80>: li a0,1
0x0000000000436488 <+82>: auipc ra,0x163
0x000000000043648c <+86>: jalr 1356(ra) # 0x5999d4 <errhidecontext>
0x00000000004364be <+136>: mv a0,s1
0x00000000004364c0 <+138>: jal 0x438f90 <pgaio_io_get_op_name>
0x00000000004364c4 <+142>: mv s2,a0
0x00000000004364c6 <+144>: mv a0,s1
0x00000000004364c8 <+146>: jal 0x439076 <pgaio_io_get_target_name>
0x00000000004364d2 <+156>: mv a3,a0
0x00000000004364e6 <+176>: mulw a1,s0,s3
0x0000000000436502 <+204>: mulw a1,s0,s3
0x000000000043650c <+214>: auipc a0,0x242
0x0000000000436510 <+218>: addi a0,a0,1799 # 0x678c13
0x0000000000436514 <+222>: mv a2,s2
0x0000000000436516 <+224>: auipc ra,0x161
0x000000000043651a <+228>: jalr -52(ra) # 0x5974e2 <errmsg_internal>
0x000000000043651e <+232>: auipc a0,0x242
0x0000000000436522 <+236>: addi a0,a0,223 # 0x6785fd
0x0000000000436526 <+240>: auipc a1,0x242
0x000000000043652a <+244>: addi a2,a1,1836 # 0x678c52
0x000000000043652e <+248>: li a1,394
0x0000000000436532 <+252>: auipc ra,0x161
0x0000000000436536 <+256>: jalr -694(ra) # 0x59727c <errfinish>
393 "updating state to %s",
394 pgaio_io_state_get_name(new_state));
395
396 /*
397 * Ensure the changes signified by the new state are visible
before the
398 * new state becomes visible.
399 */
400 pg_write_barrier();
0x000000000043653a <+260>: fence rw,w
401
402 ioh->state = new_state;
0x000000000043653e <+264>: sb s4,0(s1)
0x0000000000436542 <+268>: ld ra,40(sp)
0x0000000000436544 <+270>: ld s0,32(sp)
0x0000000000436546 <+272>: ld s1,24(sp)
0x0000000000436548 <+274>: ld s2,16(sp)
0x000000000043654a <+276>: ld s3,8(sp)
0x000000000043654c <+278>: ld s4,0(sp)
403 }
0x000000000043654e <+280>: addi sp,sp,48
0x0000000000436550 <+282>: ret
404
405 static void
406 pgaio_io_resowner_register(PgAioHandle *ioh)
407 {
408 Assert(!ioh->resowner);
409 Assert(CurrentResourceOwner);
410
411 ResourceOwnerRememberAioHandle(CurrentResourceOwner,
&ioh->resowner_node);
412 ioh->resowner = CurrentResourceOwner;
413 }
414
415 /*
416 * Stage IO for execution and, if appropriate, submit it immediately.
417 *
418 * Should only be called from pgaio_io_start_*().
419 */
420 void
421 pgaio_io_stage(PgAioHandle *ioh, PgAioOp op)
422 {
423 bool needs_synchronous;
424
425 Assert(ioh->state == PGAIO_HS_HANDED_OUT);
426 Assert(pgaio_my_backend->handed_out_io == ioh);
427 Assert(pgaio_io_has_target(ioh));
428
429 /*
430 * Otherwise an interrupt, in the middle of staging and
possibly executing
431 * the IO, could end up trying to wait for the IO, leading to
state
432 * confusion.
433 */
434 HOLD_INTERRUPTS();
435
436 ioh->op = op;
437 ioh->result = 0;
438
439 pgaio_io_update_state(ioh, PGAIO_HS_DEFINED);
440
441 /* allow a new IO to be staged */
442 pgaio_my_backend->handed_out_io = NULL;
443
444 pgaio_io_call_stage(ioh);
445
446 pgaio_io_update_state(ioh, PGAIO_HS_STAGED);
447
448 /*
449 * Synchronous execution has to be executed, well,
synchronously, so check
450 * that first.
451 */
452 needs_synchronous = pgaio_io_needs_synchronous_execution(ioh);
453
454 pgaio_debug_io(DEBUG3, ioh,
455 "staged (synchronous: %d, in_batch:
%d)",
456 needs_synchronous,
pgaio_my_backend->in_batchmode);
457
458 if (!needs_synchronous)
459 {
460
pgaio_my_backend->staged_ios[pgaio_my_backend->num_staged_ios++] = ioh;
461 Assert(pgaio_my_backend->num_staged_ios <=
PGAIO_SUBMIT_BATCH_SIZE);
462
463 /*
464 * Unless code explicitly opted into batching IOs,
submit the IO
465 * immediately.
466 */
467 if (!pgaio_my_backend->in_batchmode)
468 pgaio_submit_staged();
469 }
470 else
471 {
472 pgaio_io_prepare_submit(ioh);
473 pgaio_io_perform_synchronously(ioh);
474 }
475
476 RESUME_INTERRUPTS();
477 }
478
479 bool
480 pgaio_io_needs_synchronous_execution(PgAioHandle *ioh)
481 {
482 /*
483 * If the caller said to execute the IO synchronously, do so.
484 *
485 * XXX: We could optimize the logic when to execute
synchronously by first
486 * checking if there are other IOs in flight and only
synchronously
487 * executing if not. Unclear whether that'll be sufficiently
common to be
488 * worth worrying about.
489 */
490 if (ioh->flags & PGAIO_HF_SYNCHRONOUS)
491 return true;
492
493 /* Check if the IO method requires synchronous execution of IO
*/
494 if (pgaio_method_ops->needs_synchronous_execution)
495 return
pgaio_method_ops->needs_synchronous_execution(ioh);
496
497 return false;
498 }
499
500 /*
501 * Handle IO being processed by IO method.
502 *
503 * Should be called by IO methods / synchronous IO execution, just
before the
504 * IO is performed.
505 */
506 void
507 pgaio_io_prepare_submit(PgAioHandle *ioh)
508 {
509 pgaio_io_update_state(ioh, PGAIO_HS_SUBMITTED);
510
511 dclist_push_tail(&pgaio_my_backend->in_flight_ios, &ioh->node);
512 }
513
514 /*
515 * Handle IO getting completed by a method.
516 *
517 * Should be called by IO methods / synchronous IO execution, just
after the
518 * IO has been performed.
519 *
520 * Expects to be called in a critical section. We expect IOs to be
usable for
521 * WAL etc, which requires being able to execute completion callbacks
in a
522 * critical section.
523 */
524 void
525 pgaio_io_process_completion(PgAioHandle *ioh, int result)
526 {
527 Assert(ioh->state == PGAIO_HS_SUBMITTED);
528
529 Assert(CritSectionCount > 0);
530
531 ioh->result = result;
532
533 pgaio_io_update_state(ioh, PGAIO_HS_COMPLETED_IO);
534
535 INJECTION_POINT("aio-process-completion-before-shared", ioh);
536
537 pgaio_io_call_complete_shared(ioh);
538
539 pgaio_io_update_state(ioh, PGAIO_HS_COMPLETED_SHARED);
540
541 /* condition variable broadcast ensures state is visible before
wakeup */
542 ConditionVariableBroadcast(&ioh->cv);
543
544 /* contains call to pgaio_io_call_complete_local() */
545 if (ioh->owner_procno == MyProcNumber)
546 pgaio_io_reclaim(ioh);
547 }
548
549 /*
550 * Has the IO completed and thus the IO handle been reused?
551 *
552 * This is useful when waiting for IO completion at a low level (e.g.
in an IO
553 * method's ->wait_one() callback).
554 */
555 bool
556 pgaio_io_was_recycled(PgAioHandle *ioh, uint64 ref_generation,
PgAioHandleState *state)
557 {
558 *state = ioh->state;
559
560 /*
561 * Ensure that we don't see an earlier state of the handle than
ioh->state
562 * due to compiler or CPU reordering. This protects both
->generation as
563 * directly used here, and other fields in the handle accessed
in the
564 * caller if the handle was not reused.
565 */
566 pg_read_barrier();
567
568 return ioh->generation != ref_generation;
569 }
570
571 /*
572 * Wait for IO to complete. External code should never use this,
outside of
573 * the AIO subsystem waits are only allowed via pgaio_wref_wait().
574 */
575 static void
576 pgaio_io_wait(PgAioHandle *ioh, uint64 ref_generation)
577 {
578 PgAioHandleState state;
579 bool am_owner;
580
581 am_owner = ioh->owner_procno == MyProcNumber;
582
583 if (pgaio_io_was_recycled(ioh, ref_generation, &state))
584 return;
585
586 if (am_owner)
587 {
588 if (state != PGAIO_HS_SUBMITTED
589 && state != PGAIO_HS_COMPLETED_IO
590 && state != PGAIO_HS_COMPLETED_SHARED
591 && state != PGAIO_HS_COMPLETED_LOCAL)
592 {
593 elog(PANIC, "waiting for own IO %d in wrong
state: %s",
594 pgaio_io_get_id(ioh),
pgaio_io_get_state_name(ioh));
595 }
596 }
597
598 while (true)
599 {
600 if (pgaio_io_was_recycled(ioh, ref_generation, &state))
601 return;
602
603 switch ((PgAioHandleState) state)
604 {
605 case PGAIO_HS_IDLE:
606 case PGAIO_HS_HANDED_OUT:
607 elog(ERROR, "IO in wrong state: %d",
state);
608 break;
609
610 case PGAIO_HS_SUBMITTED:
611
612 /*
613 * If we need to wait via the IO
method, do so now. Don't
614 * check via the IO method if the
issuing backend is executing
615 * the IO synchronously.
616 */
617 if (pgaio_method_ops->wait_one &&
!(ioh->flags & PGAIO_HF_SYNCHRONOUS))
618 {
619 pgaio_method_ops->wait_one(ioh,
ref_generation);
620 continue;
621 }
622 /* fallthrough */
623
624 /* waiting for owner to submit */
625 case PGAIO_HS_DEFINED:
626 case PGAIO_HS_STAGED:
627 /* waiting for reaper to complete */
628 /* fallthrough */
629 case PGAIO_HS_COMPLETED_IO:
630 /* shouldn't be able to hit this
otherwise */
631 Assert(IsUnderPostmaster);
632 /* ensure we're going to get woken up */
633
ConditionVariablePrepareToSleep(&ioh->cv);
634
635 while (!pgaio_io_was_recycled(ioh,
ref_generation, &state))
636 {
637 if (state ==
PGAIO_HS_COMPLETED_SHARED ||
638 state ==
PGAIO_HS_COMPLETED_LOCAL)
639 break;
640
ConditionVariableSleep(&ioh->cv, WAIT_EVENT_AIO_IO_COMPLETION);
641 }
642
643 ConditionVariableCancelSleep();
644 break;
645
646 case PGAIO_HS_COMPLETED_SHARED:
647 case PGAIO_HS_COMPLETED_LOCAL:
648
649 /*
650 * Note that no interrupts are
processed between
651 * pgaio_io_was_recycled() and this
check - that's important
652 * as otherwise an interrupt could have
already reclaimed the
653 * handle.
654 */
655 if (am_owner)
656 pgaio_io_reclaim(ioh);
657 return;
658 }
659 }
660 }
661
662 /*
663 * Make IO handle ready to be reused after IO has completed or after the
664 * handle has been released without being used.
665 *
666 * Note that callers need to be careful about only calling this in the
right
667 * state and that no interrupts can be processed between the state
check and
668 * the call to pgaio_io_reclaim(). Otherwise interrupt processing could
669 * already have reclaimed the handle.
670 */
671 static void
672 pgaio_io_reclaim(PgAioHandle *ioh)
673 {
674 /* This is only ok if it's our IO */
675 Assert(ioh->owner_procno == MyProcNumber);
676 Assert(ioh->state != PGAIO_HS_IDLE);
677
678 /* see comment in function header */
679 HOLD_INTERRUPTS();
680
681 /*
682 * It's a bit ugly, but right now the easiest place to put the
execution
683 * of local completion callbacks is this function, as we need
to execute
684 * local callbacks just before reclaiming at multiple callsites.
685 */
686 if (ioh->state == PGAIO_HS_COMPLETED_SHARED)
687 {
688 PgAioResult local_result;
689
690 local_result = pgaio_io_call_complete_local(ioh);
691 pgaio_io_update_state(ioh, PGAIO_HS_COMPLETED_LOCAL);
692
693 if (ioh->report_return)
694 {
695 ioh->report_return->result = local_result;
696 ioh->report_return->target_data =
ioh->target_data;
697 }
698 }
699
700 pgaio_debug_io(DEBUG4, ioh,
701 "reclaiming: distilled_result:
(status %s, id %u, error_data %d), raw_result: %d",
702
pgaio_result_status_string(ioh->distilled_result.status),
703 ioh->distilled_result.id,
704 ioh->distilled_result.error_data,
705 ioh->result);
706
707 /* if the IO has been defined, it's on the in-flight list,
remove */
708 if (ioh->state != PGAIO_HS_HANDED_OUT)
709 dclist_delete_from(&pgaio_my_backend->in_flight_ios,
&ioh->node);
710
711 if (ioh->resowner)
712 {
713 ResourceOwnerForgetAioHandle(ioh->resowner,
&ioh->resowner_node);
714 ioh->resowner = NULL;
715 }
716
717 Assert(!ioh->resowner);
718
719 /*
720 * Update generation & state first, before resetting the IO's
fields,
721 * otherwise a concurrent "viewer" could think the fields are
valid, even
722 * though they are being reset. Increment the generation
first, so that
723 * we can assert elsewhere that we never wait for an IDLE IO.
While it's
724 * a bit weird for the state to go backwards for a generation,
it's OK
725 * here, as there cannot be references to the "reborn" IO yet.
Can't
726 * update both at once, so something has to give.
727 */
728 ioh->generation++;
729 pgaio_io_update_state(ioh, PGAIO_HS_IDLE);
730
731 /* ensure the state update is visible before we reset fields */
732 pg_write_barrier();
733
734 ioh->op = PGAIO_OP_INVALID;
735 ioh->target = PGAIO_TID_INVALID;
736 ioh->flags = 0;
737 ioh->num_callbacks = 0;
738 ioh->handle_data_len = 0;
739 ioh->report_return = NULL;
740 ioh->result = 0;
741 ioh->distilled_result.status = PGAIO_RS_UNKNOWN;
742
743 /*
744 * We push the IO to the head of the idle IO list, that seems
more cache
745 * efficient in cases where only a few IOs are used.
746 */
747 dclist_push_head(&pgaio_my_backend->idle_ios, &ioh->node);
748
749 RESUME_INTERRUPTS();
750 }
751
752 /*
753 * Wait for an IO handle to become usable.
754 *
755 * This only really is useful for pgaio_io_acquire().
756 */
757 static void
758 pgaio_io_wait_for_free(void)
759 {
760 int reclaimed = 0;
761
762 pgaio_debug(DEBUG2, "waiting for free IO with %d pending, %u
in-flight, %u idle IOs",
763 pgaio_my_backend->num_staged_ios,
764
dclist_count(&pgaio_my_backend->in_flight_ios),
765
dclist_count(&pgaio_my_backend->idle_ios));
766
767 /*
768 * First check if any of our IOs actually have completed - when
using
769 * worker, that'll often be the case. We could do so as part of
the loop
770 * below, but that'd potentially lead us to wait for some IO
submitted
771 * before.
772 */
773 for (int i = 0; i < io_max_concurrency; i++)
774 {
775 PgAioHandle *ioh =
&pgaio_ctl->io_handles[pgaio_my_backend->io_handle_off + i];
776
777 if (ioh->state == PGAIO_HS_COMPLETED_SHARED)
778 {
779 /*
780 * Note that no interrupts are processed
between the state check
781 * and the call to reclaim - that's important
as otherwise an
782 * interrupt could have already reclaimed the
handle.
783 *
784 * Need to ensure that there's no reordering,
in the more common
785 * paths, where we wait for IO, that's done by
786 * pgaio_io_was_recycled().
787 */
788 pg_read_barrier();
789 pgaio_io_reclaim(ioh);
790 reclaimed++;
791 }
792 }
793
794 if (reclaimed > 0)
795 return;
796
797 /*
798 * If we have any unsubmitted IOs, submit them now. We'll start
waiting in
799 * a second, so it's better they're in flight. This also
addresses the
800 * edge-case that all IOs are unsubmitted.
801 */
802 if (pgaio_my_backend->num_staged_ios > 0)
803 pgaio_submit_staged();
804
805 /* possibly some IOs finished during submission */
806 if (!dclist_is_empty(&pgaio_my_backend->idle_ios))
807 return;
808
809 if (dclist_count(&pgaio_my_backend->in_flight_ios) == 0)
810 ereport(ERROR,
811 errmsg_internal("no free IOs despite no
in-flight IOs"),
812 errdetail_internal("%d pending, %u
in-flight, %u idle IOs",
813
pgaio_my_backend->num_staged_ios,
814
dclist_count(&pgaio_my_backend->in_flight_ios),
815
dclist_count(&pgaio_my_backend->idle_ios)));
816
817 /*
818 * Wait for the oldest in-flight IO to complete.
819 *
820 * XXX: Reusing the general IO wait is suboptimal, we don't
need to wait
821 * for that specific IO to complete, we just need *any* IO to
complete.
822 */
823 {
824 PgAioHandle *ioh = dclist_head_element(PgAioHandle,
node,
825
&pgaio_my_backend->in_flight_ios);
826 uint64 generation = ioh->generation;
827
828 switch ((PgAioHandleState) ioh->state)
829 {
830 /* should not be in in-flight list */
831 case PGAIO_HS_IDLE:
832 case PGAIO_HS_DEFINED:
833 case PGAIO_HS_HANDED_OUT:
834 case PGAIO_HS_STAGED:
835 case PGAIO_HS_COMPLETED_LOCAL:
836 elog(ERROR, "shouldn't get here with
io:%d in state %d",
837 pgaio_io_get_id(ioh),
ioh->state);
838 break;
839
840 case PGAIO_HS_COMPLETED_IO:
841 case PGAIO_HS_SUBMITTED:
842 pgaio_debug_io(DEBUG2, ioh,
843 "waiting for
free io with %u in flight",
844
dclist_count(&pgaio_my_backend->in_flight_ios));
845
846 /*
847 * In a more general case this would be
racy, because the
848 * generation could increase after we
read ioh->state above.
849 * But we are only looking at IOs by
the current backend and
850 * the IO can only be recycled by this
backend. Even this is
851 * only OK because we get the handle's
generation before
852 * potentially processing interrupts,
e.g. as part of
853 * pgaio_debug_io().
854 */
855 pgaio_io_wait(ioh, generation);
856 break;
857
858 case PGAIO_HS_COMPLETED_SHARED:
859
860 /*
861 * It's possible that another backend
just finished this IO.
862 *
863 * Note that no interrupts are
processed between the state
864 * check and the call to reclaim -
that's important as
865 * otherwise an interrupt could have
already reclaimed the
866 * handle.
867 *
868 * Need to ensure that there's no
reordering, in the more
869 * common paths, where we wait for IO,
that's done by
870 * pgaio_io_was_recycled().
871 */
872 pg_read_barrier();
873 pgaio_io_reclaim(ioh);
874 break;
875 }
876
877 if (dclist_count(&pgaio_my_backend->idle_ios) == 0)
878 elog(PANIC, "no idle IO after waiting for IO to
terminate");
879 return;
880 }
881 }
882
883 /*
884 * Internal - code outside of AIO should never need this and it'd be
hard for
885 * such code to be safe.
886 */
887 static PgAioHandle *
888 pgaio_io_from_wref(PgAioWaitRef *iow, uint64 *ref_generation)
889 {
890 PgAioHandle *ioh;
891
892 Assert(iow->aio_index < pgaio_ctl->io_handle_count);
893
894 ioh = &pgaio_ctl->io_handles[iow->aio_index];
895
896 *ref_generation = ((uint64) iow->generation_upper) << 32 |
897 iow->generation_lower;
898
899 Assert(*ref_generation != 0);
900
901 return ioh;
902 }
903
904 static const char *
905 pgaio_io_state_get_name(PgAioHandleState s)
906 {
907 #define PGAIO_HS_TOSTR_CASE(sym) case PGAIO_HS_##sym: return #sym
908 switch ((PgAioHandleState) s)
0x00000000004364d4 <+158>: bltu a2,a1,0x436500
<pgaio_io_update_state+202>
0x00000000004364d8 <+162>: slli a1,a1,0x3
0x00000000004364da <+164>: auipc a0,0x485
0x00000000004364de <+168>: addi a0,a0,-530 # 0x8bb2c8
0x00000000004364e2 <+172>: add a0,a0,a1
0x00000000004364e4 <+174>: ld a4,0(a0)
0x00000000004364ea <+180>: bltu a2,s4,0x43650a
<pgaio_io_update_state+212>
0x00000000004364ee <+184>: slli a0,s4,0x3
0x00000000004364f2 <+188>: auipc a2,0x485
0x00000000004364f6 <+192>: addi a2,a2,-554 # 0x8bb2c8
0x00000000004364fa <+196>: add a0,a0,a2
0x00000000004364fc <+198>: ld a5,0(a0)
0x00000000004364fe <+200>: j 0x43650c <pgaio_io_update_state+214>
0x0000000000436500 <+202>: li a4,0
0x0000000000436506 <+208>: bgeu a2,s4,0x4364ee
<pgaio_io_update_state+184>
0x000000000043650a <+212>: li a5,0
909 {
910 PGAIO_HS_TOSTR_CASE(IDLE);
911 PGAIO_HS_TOSTR_CASE(HANDED_OUT);
912 PGAIO_HS_TOSTR_CASE(DEFINED);
913 PGAIO_HS_TOSTR_CASE(STAGED);
914 PGAIO_HS_TOSTR_CASE(SUBMITTED);
915 PGAIO_HS_TOSTR_CASE(COMPLETED_IO);
916 PGAIO_HS_TOSTR_CASE(COMPLETED_SHARED);
917 PGAIO_HS_TOSTR_CASE(COMPLETED_LOCAL);
918 }
919 #undef PGAIO_HS_TOSTR_CASE
920
921 return NULL; /* silence compiler */
922 }
923
924 const char *
925 pgaio_io_get_state_name(PgAioHandle *ioh)
926 {
927 return pgaio_io_state_get_name(ioh->state);
0x00000000004364cc <+150>: lbu a1,0(s1)
0x00000000004364d0 <+154>: li a2,7
End of assembler dump.
(gdb) GNU gdb (Debian 15.2-1) 15.2
Copyright (C) 2024 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "riscv64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from tmp_install/usr/local/pgsql/bin/postgres...
(gdb) Dump of assembler code for function pgaio_io_was_recycled:
558 *state = ioh->state;
0x0000000000436f98 <+0>: lbu a3,0(a0)
0x0000000000436f9c <+4>: sw a3,0(a2)
559
560 /*
561 * Ensure that we don't see an earlier state of the handle than
ioh->state
562 * due to compiler or CPU reordering. This protects both
->generation as
563 * directly used here, and other fields in the handle accessed
in the
564 * caller if the handle was not reused.
565 */
566 pg_read_barrier();
0x0000000000436f9e <+6>: fence r,rw
567
568 return ioh->generation != ref_generation;
0x0000000000436fa2 <+10>: ld a0,64(a0)
0x0000000000436fa4 <+12>: xor a0,a0,a1
0x0000000000436fa6 <+14>: snez a0,a0
0x0000000000436faa <+18>: ret
End of assembler dump.
(gdb)