Hello Thomas,
24.05.2025 14:42, Thomas Munro wrote:
On Sat, May 24, 2025 at 3:17 PM Tom Lane <t...@sss.pgh.pa.us> wrote:
So it seems that "very low-probability issue in our Mac AIO code" is
the most probable description.
There isn't any macOS-specific AIO code so my first guess would be
that it might be due to aarch64 weak memory reordering (though Andres
speculated that itt should all be one backend, huh), if it's not just
a timing luck thing. Alexander, were the other OSes you tried all on
x86?
As I wrote off-list before, I had tried x86_64 only, but since then I
tried to reproduce the issue on an aarch64 server with Ubuntu 24.04,
running 10, then 40 instances of t/027_stream_regress.pl in parallel. I've
also multiplied "test: brin ..." line x10. But the issue is still not
reproduced (in 8+ hours).
However, I've managed to get an AIO-related assertion failure on macOS 14.5
(Darwin Kernel Version 23.5.0) quite easily, running 027_stream_regress.pl
the same way:
2 # +++ tap check in src/test/recovery_2 +++
3 # +++ tap check in src/test/recovery_3 +++
4 # +++ tap check in src/test/recovery_4 +++
5 # +++ tap check in src/test/recovery_5 +++
1 # +++ tap check in src/test/recovery_1 +++
7 # +++ tap check in src/test/recovery_7 +++
6 # +++ tap check in src/test/recovery_6 +++
8 # +++ tap check in src/test/recovery_8 +++
1 [16:07:31] t/027_stream_regress.pl ..
1 Dubious, test returned 29 (wstat 7424, 0x1d00)
1 All 1 subtests passed
1 [16:07:31]
1
1 Test Summary Report
1 -------------------
1 t/027_stream_regress.pl (Wstat: 7424 Tests: 1 Failed: 0)
1 Non-zero exit status: 29
1 Parse errors: No plan found in TAP output
1 Files=1, Tests=1, 197 wallclock secs ( 0.01 usr 0.01 sys + 4.47 cusr
5.79 csys = 10.28 CPU)
1 Result: FAIL
TRAP: failed Assert("ioh->op == PGAIO_OP_INVALID"), File: "aio_io.c", Line:
161, PID: 32355
0 postgres 0x0000000104f078f4 ExceptionalCondition
+ 236
1 postgres 0x0000000104c0ebd4
pgaio_io_before_start + 260
2 postgres 0x0000000104c0ea94 pgaio_io_start_readv
+ 36
3 postgres 0x0000000104c2d4e8 FileStartReadV + 252
4 postgres 0x0000000104c807c8 mdstartreadv + 668
5 postgres 0x0000000104c83db0 smgrstartreadv + 116
6 postgres 0x0000000104c1675c AsyncReadBuffers +
1444
7 postgres 0x0000000104c15868 StartReadBuffersImpl
+ 1196
8 postgres 0x0000000104c153ac StartReadBuffers + 64
9 postgres 0x0000000104c129b4
read_stream_start_pending_read + 1204
10 postgres 0x0000000104c12018
read_stream_look_ahead + 812
11 postgres 0x0000000104c11cd0
read_stream_next_buffer + 1396
12 postgres 0x0000000104606c38
heap_fetch_next_buffer + 284
13 postgres 0x00000001045f79d4 heapgettup_pagemode
+ 192
14 postgres 0x00000001045f7fa4 heap_getnextslot + 84
15 postgres 0x000000010497a404
table_scan_getnextslot + 340
16 postgres 0x000000010497a210 SeqNext + 148
17 postgres 0x000000010497a8e4 ExecScanFetch + 772
18 postgres 0x000000010497a4b8 ExecScanExtended +
164
19 postgres 0x0000000104979bf0
ExecSeqScanWithProject + 244
20 postgres 0x0000000104953ce0 ExecProcNode + 68
21 postgres 0x000000010494f148 MultiExecPrivateHash
+ 64
22 postgres 0x000000010494ed48 MultiExecHash + 100
23 postgres 0x0000000104925a18 MultiExecProcNode +
180
24 postgres 0x0000000104957220 ExecHashJoinImpl +
628
25 postgres 0x00000001049566dc ExecHashJoin + 32
26 postgres 0x0000000104925958 ExecProcNodeFirst +
100
27 postgres 0x000000010491c08c ExecProcNode + 68
28 postgres 0x000000010491692c ExecutePlan + 268
29 postgres 0x000000010491679c standard_ExecutorRun
+ 504
30 pg_stat_statements.dylib 0x00000001055e132c pgss_ExecutorRun +
212
31 postgres 0x0000000104916580 ExecutorRun + 72
32 postgres 0x000000010491f59c ParallelQueryMain +
508
33 postgres 0x00000001046968d4 ParallelWorkerMain +
1712
34 postgres 0x0000000104b476f4 BackgroundWorkerMain
+ 824
35 postgres 0x0000000104b4b87c
postmaster_child_launch + 492
36 postgres 0x0000000104b56610
StartBackgroundWorker + 416
37 postgres 0x0000000104b51378
maybe_start_bgworkers + 552
38 postgres 0x0000000104b540cc
LaunchMissingBackgroundProcesses + 1316
39 postgres 0x0000000104b51650 ServerLoop + 616
40 postgres 0x0000000104b4fef0 PostmasterMain + 6632
41 postgres 0x00000001049bfe20 main + 952
42 dyld 0x00000001849d60e0 start + 2360
2025-05-24 16:07:23.427 EDT [22120:4] LOG: background worker "parallel worker" (PID 32355) was terminated by signal 6:
Abort trap: 6
2025-05-24 16:07:23.427 EDT [22120:5] DETAIL: Failed process was running:
select count(*) from join_foo
left join (select b1.id, b1.t from join_bar b1 join join_bar b2 using
(id)) ss
I'm yet to see the Assert triggered on the buildfarm, but this one looks
interesting too.
(I can share the complete patch + script for such testing, if it can be
helpful.)
Best regards,
Alexander Lakhin
Neon (https://neon.tech)