Hello Thomas,

24.05.2025 14:42, Thomas Munro wrote:
On Sat, May 24, 2025 at 3:17 PM Tom Lane <t...@sss.pgh.pa.us> wrote:
So it seems that "very low-probability issue in our Mac AIO code" is
the most probable description.
There isn't any macOS-specific AIO code so my first guess would be
that it might be due to aarch64 weak memory reordering (though Andres
speculated that itt should all be one backend, huh), if it's not just
a timing luck thing.  Alexander, were the other OSes you tried all on
x86?

As I wrote off-list before, I had tried x86_64 only, but since then I
tried to reproduce the issue on an aarch64 server with Ubuntu 24.04,
running 10, then 40 instances of t/027_stream_regress.pl in parallel. I've
also multiplied "test: brin ..." line x10. But the issue is still not
reproduced (in 8+ hours).

However, I've managed to get an AIO-related assertion failure on macOS 14.5
(Darwin Kernel Version 23.5.0) quite easily, running 027_stream_regress.pl
the same way:
2       # +++ tap check in src/test/recovery_2 +++
3       # +++ tap check in src/test/recovery_3 +++
4       # +++ tap check in src/test/recovery_4 +++
5       # +++ tap check in src/test/recovery_5 +++
1       # +++ tap check in src/test/recovery_1 +++
7       # +++ tap check in src/test/recovery_7 +++
6       # +++ tap check in src/test/recovery_6 +++
8       # +++ tap check in src/test/recovery_8 +++
1       [16:07:31] t/027_stream_regress.pl ..
1       Dubious, test returned 29 (wstat 7424, 0x1d00)
1       All 1 subtests passed
1       [16:07:31]
1
1       Test Summary Report
1       -------------------
1       t/027_stream_regress.pl (Wstat: 7424 Tests: 1 Failed: 0)
1         Non-zero exit status: 29
1         Parse errors: No plan found in TAP output
1       Files=1, Tests=1, 197 wallclock secs ( 0.01 usr  0.01 sys + 4.47 cusr  
5.79 csys = 10.28 CPU)
1       Result: FAIL

TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c", Line: 
161, PID: 32355
0   postgres                            0x0000000104f078f4 ExceptionalCondition 
+ 236
1   postgres                            0x0000000104c0ebd4 
pgaio_io_before_start + 260
2   postgres                            0x0000000104c0ea94 pgaio_io_start_readv 
+ 36
3   postgres                            0x0000000104c2d4e8 FileStartReadV + 252
4   postgres                            0x0000000104c807c8 mdstartreadv + 668
5   postgres                            0x0000000104c83db0 smgrstartreadv + 116
6   postgres                            0x0000000104c1675c AsyncReadBuffers + 
1444
7   postgres                            0x0000000104c15868 StartReadBuffersImpl 
+ 1196
8   postgres                            0x0000000104c153ac StartReadBuffers + 64
9   postgres                            0x0000000104c129b4 
read_stream_start_pending_read + 1204
10  postgres                            0x0000000104c12018 
read_stream_look_ahead + 812
11  postgres                            0x0000000104c11cd0 
read_stream_next_buffer + 1396
12  postgres                            0x0000000104606c38 
heap_fetch_next_buffer + 284
13  postgres                            0x00000001045f79d4 heapgettup_pagemode 
+ 192
14  postgres                            0x00000001045f7fa4 heap_getnextslot + 84
15  postgres                            0x000000010497a404 
table_scan_getnextslot + 340
16  postgres                            0x000000010497a210 SeqNext + 148
17  postgres                            0x000000010497a8e4 ExecScanFetch + 772
18  postgres                            0x000000010497a4b8 ExecScanExtended + 
164
19  postgres                            0x0000000104979bf0 
ExecSeqScanWithProject + 244
20  postgres                            0x0000000104953ce0 ExecProcNode + 68
21  postgres                            0x000000010494f148 MultiExecPrivateHash 
+ 64
22  postgres                            0x000000010494ed48 MultiExecHash + 100
23  postgres                            0x0000000104925a18 MultiExecProcNode + 
180
24  postgres                            0x0000000104957220 ExecHashJoinImpl + 
628
25  postgres                            0x00000001049566dc ExecHashJoin + 32
26  postgres                            0x0000000104925958 ExecProcNodeFirst + 
100
27  postgres                            0x000000010491c08c ExecProcNode + 68
28  postgres                            0x000000010491692c ExecutePlan + 268
29  postgres                            0x000000010491679c standard_ExecutorRun 
+ 504
30  pg_stat_statements.dylib            0x00000001055e132c pgss_ExecutorRun + 
212
31  postgres                            0x0000000104916580 ExecutorRun + 72
32  postgres                            0x000000010491f59c ParallelQueryMain + 
508
33  postgres                            0x00000001046968d4 ParallelWorkerMain + 
1712
34  postgres                            0x0000000104b476f4 BackgroundWorkerMain 
+ 824
35  postgres                            0x0000000104b4b87c 
postmaster_child_launch + 492
36  postgres                            0x0000000104b56610 
StartBackgroundWorker + 416
37  postgres                            0x0000000104b51378 
maybe_start_bgworkers + 552
38  postgres                            0x0000000104b540cc 
LaunchMissingBackgroundProcesses + 1316
39  postgres                            0x0000000104b51650 ServerLoop + 616
40  postgres                            0x0000000104b4fef0 PostmasterMain + 6632
41  postgres                            0x00000001049bfe20 main + 952
42  dyld                                0x00000001849d60e0 start + 2360
2025-05-24 16:07:23.427 EDT [22120:4] LOG:  background worker "parallel worker" (PID 32355) was terminated by signal 6: Abort trap: 6
2025-05-24 16:07:23.427 EDT [22120:5] DETAIL:  Failed process was running: 
select count(*) from join_foo
          left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using 
(id)) ss

I'm yet to see the Assert triggered on the buildfarm, but this one looks
interesting too.

(I can share the complete patch + script for such testing, if it can be 
helpful.)

Best regards,
Alexander Lakhin
Neon (https://neon.tech)


Reply via email to