Hello Thomas and Andres,

04.06.2025 23:32, Thomas Munro wrote:
On Thu, Jun 5, 2025 at 8:02 AM Andres Freund <and...@anarazel.de> wrote:
On 2025-06-03 08:00:01 +0300, Alexander Lakhin wrote:
2025-06-03 00:19:09.282 EDT [25175:1] LOG: !!!pgaio_io_before_start| ioh:
0x104c3e1a0, ioh->op: 1, ioh->state: 1, ioh->result: 0, ioh->num_callbacks:
2, ioh->generation: 21694
But here it somehow turned to 1 (PGAIO_OP_READV), despite there not being any
"normal" reason for that. We know that the IO wasn't actually started as
otherwise pgaio_io_start_readv() would have logged that fact.
A cheap/easy thing to try, maybe: give that enumeration more
distinctive values like 0xaaaa, 0xbbbb, or whatever, to see if we get
a wild 1 here?

Thank you for your attention to this and for the tip! Today I tried the
following:
--- a/src/include/storage/aio.h
+++ b/src/include/storage/aio.h
@@ -89,8 +89,8 @@ typedef enum PgAioOp
        /* intentionally the zero value, to help catch zeroed memory etc */
        PGAIO_OP_INVALID = 0,

-       PGAIO_OP_READV,
-       PGAIO_OP_WRITEV,
+       PGAIO_OP_READV = 0xaa,
+       PGAIO_OP_WRITEV = 0xbb,

And got::
!!!AsyncReadBuffers [68292] (1)| blocknum: 9, ioh: 0x102b26230, ioh->op: 0, ioh->state: 1, ioh->result: 0, ioh->num_callbacks: 0, ioh->generation: 13038
!!!pgaio_io_start_readv [68292]| ioh: 0x102b26230, ioh->op: 170, 
ioh->generation: 13038
!!!pgaio_io_reclaim [67729]| ioh: 0x102b25540, ioh->op: 170, ioh->state: 6, ioh->result: 8192, ioh->num_callbacks: 2, ioh->generation: 47468
!!!pgaio_io_reclaim [67729]| ioh: 0x102b25540, ioh->op: 0, ioh->generation: 
47469
!!!pgaio_io_reclaim [68292]| ioh: 0x102b26230, ioh->op: 170, ioh->state: 6, ioh->result: 8192, ioh->num_callbacks: 2, ioh->generation: 13038
!!!pgaio_io_reclaim [68292]| ioh: 0x102b26230, ioh->op: 0, ioh->generation: 
13039
!!!AsyncReadBuffers [68292] (1)| blocknum: 10, ioh: 0x102b26230, ioh->op: 0, ioh->state: 1, ioh->result: 0, ioh->num_callbacks: 0, ioh->generation: 13039
!!!pgaio_io_start_readv [68292]| ioh: 0x102b26230, ioh->op: 170, 
ioh->generation: 13039
!!!pgaio_io_reclaim [67718]| ioh: 0x102b25390, ioh->op: 170, ioh->state: 6, ioh->result: 8192, ioh->num_callbacks: 2, ioh->generation: 35821
!!!pgaio_io_reclaim [67718]| ioh: 0x102b25390, ioh->op: 0, ioh->generation: 
35822
!!!pgaio_io_reclaim [68292]| ioh: 0x102b26230, ioh->op: 170, ioh->state: 6, ioh->result: 8192, ioh->num_callbacks: 2, ioh->generation: 13039
!!!pgaio_io_reclaim [68292]| ioh: 0x102b26230, ioh->op: 0, ioh->generation: 
13040
!!!AsyncReadBuffers [68292] (1)| blocknum: 11, ioh: 0x102b26230, ioh->op: 0, ioh->state: 1, ioh->result: 0, ioh->num_callbacks: 0, ioh->generation: 13040 2025-06-05 04:15:38.847 EDT [68292:1] LOG: !!!pgaio_io_before_start| ioh: 0x102b26230, ioh->op: 170, ioh->state: 1, ioh->result: 0, ioh->num_callbacks: 2, ioh->generation: 13040
### I reproduced the failure a few times and observed "ioh->op: 170" each time. 
###
2025-06-05 04:15:38.847 EDT [68292:2] STATEMENT:  select count(*) from simple r 
join extremely_skewed s using (id);
TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c", Line: 
164, PID: 68292

0   postgres                            0x000000010144b550 ExceptionalCondition 
+ 236
1   postgres                            0x0000000101152260 
pgaio_io_before_start + 516
2   postgres                            0x0000000101151fb4 pgaio_io_start_readv 
+ 36
3   postgres                            0x0000000101171144 FileStartReadV + 252
4   postgres                            0x00000001011c4424 mdstartreadv + 668
5   postgres                            0x00000001011c7a0c smgrstartreadv + 116
6   postgres                            0x000000010115a3b8 AsyncReadBuffers + 
2028
7   postgres                            0x0000000101159354 StartReadBuffersImpl 
+ 1196
8   postgres                            0x0000000101158e98 StartReadBuffers + 64
9   postgres                            0x00000001011564a0 
read_stream_start_pending_read + 1204
10  postgres                            0x0000000101155b84 
read_stream_look_ahead + 812
11  postgres                            0x000000010115571c 
read_stream_next_buffer + 2356
12  postgres                            0x0000000100b4a04c 
heap_fetch_next_buffer + 284
13  postgres                            0x0000000100b3ade8 heapgettup_pagemode 
+ 192
14  postgres                            0x0000000100b3b3b8 heap_getnextslot + 84
15  postgres                            0x0000000100ebd818 
table_scan_getnextslot + 340
16  postgres                            0x0000000100ebd624 SeqNext + 148
17  postgres                            0x0000000100ebdcf8 ExecScanFetch + 772
18  postgres                            0x0000000100ebd8cc ExecScanExtended + 
164
19  postgres                            0x0000000100ebd004 
ExecSeqScanWithProject + 244
20  postgres                            0x0000000100e970f4 ExecProcNode + 68

I also tried reproducing this with -DFDDEBUG, no luck so far, probably due
to significant change of the test duration.

Best regards,
Alexander Lakhin


Reply via email to