Hi, Tomas off-list reported (he hit with a WIP patch applied) a crash he had been seeing in test_aio/002_io_workers. While I can't reproduce it as easily as he can, I hit it after a few hundred iterations of test_aio/002_io_workers.
The one time I hit it I had forgotten to enable core dumps in the relevant terminal, so I don't have a core dump yet. Tomas had reported this: Program terminated with signal SIGSEGV, Segmentation fault. #0 0x000000000095e638 in SetLatch (latch=0x0) at latch.c:304 304 if (latch->is_set) (gdb) bt #0 0x000000000095e638 in SetLatch (latch=0x0) at latch.c:304 #1 0x000000000093ed0f in IoWorkerMain (startup_data=0x0, startup_data_len=0) at method_worker.c:499 #2 0x00000000008ada1e in postmaster_child_launch (child_type=B_IO_WORKER, child_slot=230, startup_data=0x0, startup_data_len=0, client_sock=0x0) at launch_backend.c:290 #3 0x00000000008b4011 in StartChildProcess (type=B_IO_WORKER) at postmaster.c:3973 #4 0x00000000008b4848 in maybe_adjust_io_workers () at postmaster.c:4404 #5 0x00000000008b0bee in PostmasterMain (argc=4, argv=0x115570a0) at postmaster.c:1382 #6 0x0000000000765082 in main (argc=4, argv=0x115570a0) at main.c:227 Staring at the code for a while I think I see the problem: If a worker was idle at the time it exited, it is not removed from the set of idle workers. Which in turn means that pgaio_choose_idle_worker() in /* Got one. Clear idle flag. */ io_worker_control->idle_worker_mask &= ~(UINT64_C(1) << MyIoWorkerId); /* See if we can wake up some peers. */ nwakeups = Min(pgaio_worker_submission_queue_depth(), IO_WORKER_WAKEUP_FANOUT); for (int i = 0; i < nwakeups; ++i) { if ((worker = pgaio_choose_idle_worker()) < 0) break; latches[nlatches++] = io_worker_control->workers[worker].latch; } can return a worker that's actually not currently running and thus does not have a latch set. I suspect the reason that this was hit with Tomas' patch is that it adds use of streaming reads to index scans, and thus makes it plausible at all to hit AIO in the path. Tomas reported a 25% failure rate, but for me, even with his patch applied, it is more like 0.1% or so. Not sure what explains that. Maybe 002_io_workers should run AIO while starting / stopping workers to make it easier to hit problems like that? The fix seems relatively clear - unset the bit in pgaio_worker_die(). And add an assertion in pgaio_choose_idle_worker() ensuring that the worker "slot" is in use. Greetings, Andres Freund