When measuring the time to create a connection, it is ~2.3X longer with io_method=io_uring then with io_method=sync (6.9ms vs 3ms), and the postmaster process uses ~3.5X more CPU to create connections.
The reproduction case so far is my usage of the Insert Benchmark on a large server with 48 cores. I need to fix the benchmark client -- today it creates ~1000 connections/s to run a monitoring query in between every 100 queries and the extra latency from connection create makes results worse for one of the benchmark steps. While I can fix the benchmark client to avoid this, I am curious about the extra latency in connection create. I used "perf record -e cycles -F 333 -g -p $pidof_postmaster -- sleep 30" but I have yet to find a big difference from the reports generated with that for io_method=io_uring vs =sync. It shows that much time is spent in the kernel dealing with the VM (page tables, etc). The server runs Ubuntu 22.04.4. I compiled the Postgres 18beta1 release from source via: ./configure --prefix=$pfx --enable-debug CFLAGS="-O2 -fno-omit-frame-pointer" --with-lz4 --with-liburing Output from configure includes: checking whether to build with liburing support... yes checking for liburing... yes io_uring support was installed via: sudo apt install liburing-dev and I have 2.1-2build1 libc is Ubuntu GLIBC 2.35-0ubuntu3.10 gcc is 11.4.0 More performance info is here: https://mdcallag.github.io/reports/25_06_01.pg.all.mem.hetz/all.html#summary The config files I used only differ WRT io_method * io_method=sync - https://github.com/mdcallag/mytools/blob/master/bench/conf/arc/may25.hetzner/pg18b1git_o2nofp/conf.diff.cx10b_c32r128 * io_method=workers - https://github.com/mdcallag/mytools/blob/master/bench/conf/arc/may25.hetzner/pg18b1git_o2nofp/conf.diff.cx10cw4_c32r128 * io_method=io_uring - https://github.com/mdcallag/mytools/blob/master/bench/conf/arc/may25.hetzner/pg18b1git_o2nofp/conf.diff.cx10d_c32r128 The symptoms are: * ~20% reduction in point queries/s with io_method=io_uring vs =sync, =workers or Postgres 17.4, and the issue here is not that SELECT performance has changed, it is that my benchmark client sometimes creates connections in between running queries and the new latency from that for io_method=io_uring hurts throughput * CPU/query and context switches /query are similar, with io_uring the CPU/query might be ~4% larger >From sampled thread stacks of the postmaster when I use io_uring the common stack is: arch_fork,__GI__Fork,__libc_fork,fork_process,postmaster_child_launch,BackendStartup,ServerLoop,PostmasterMain,main While the typical stack with io_method=sync is: epoll_wait,WaitEventSetWaitBlock,WaitEventSetWait,ServerLoop,PostmasterMain,main I run "ps" during each benchmark step and on example of what I see during a point query benchmarks step (qp100.L2) with io_method=uring is below. The benchmark step runs for 300 seconds. ---> from the start of the step mdcallag 3762684 0.9 1.5 103027276 2031612 ? Ss 03:12 0:14 /home/mdcallag/d/pg18beta1_o2nofp/bin/postgres -D /data/m/pg ---> from the end of the step mdcallag 3762684 15.9 1.5 103027276 2031612 ? Rs 03:12 5:04 /home/mdcallag/d/pg18beta1_o2nofp/bin/postgres -D /data/m/pg And from top I see: ---> with =io_uring PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 3762684 mdcallag 20 0 98.3g 1.9g 1.9g R 99.4 1.5 3:04.87 /home/mdcallag/d/pg18beta1_o2nofp/bin/postgres -D /data/m/pg --> with =sync PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2913673 mdcallag 20 0 98.3g 1.9g 1.9g S 28.3 1.5 0:54.13 /home/mdcallag/d/pg18beta1_o2nofp/bin/postgres -D /data/m/pg The postmaster had used 0:14 (14 seconds) of CPU time by the start of the benchmark step and 5:04 (304 seconds) by the end. For the same step with io_method=sync it was 0:05 at the start and 1:27 at the end. So the postmaster used ~290 seconds of cpu with =io_uring vs ~82 with =sync, which is ~3.5X more CPU on the postmaster per connection attempt. >From vmstat what I see is that some of the rates (cs = context switches, us = user CPU) are ~20% smaller with =io_uring, which is reasonable given that the throughput is also ~20% smaller. But sy (system CPU) is not 20% smaller because of the overhead from all of those calls to fork (or clone). Avg rates from vmstat cs us sy us+sy 492961 25.0 14.0 39.0 --> with =sync 401233 20.1 14.0 34.1 ---> with =io_uring -- Mark Callaghan mdcal...@gmail.com