When measuring the time to create a connection, it is ~2.3X longer with
io_method=io_uring then with io_method=sync (6.9ms vs 3ms), and the
postmaster process uses ~3.5X more CPU to create connections.

The reproduction case so far is my usage of the Insert Benchmark on a large
server with 48 cores. I need to fix the benchmark client -- today it
creates ~1000 connections/s to run a monitoring query in between every 100
queries and the extra latency from connection create makes results worse
for one of the benchmark steps. While I can fix the benchmark client to
avoid this, I am curious about the extra latency in connection create.

I used "perf record -e cycles -F 333 -g -p $pidof_postmaster -- sleep 30"
but I have yet to find a big difference from the reports generated with
that for io_method=io_uring vs =sync. It shows that much time is spent in
the kernel dealing with the VM (page tables, etc).

The server runs Ubuntu 22.04.4. I compiled the Postgres 18beta1 release
from source via:
./configure --prefix=$pfx --enable-debug CFLAGS="-O2
-fno-omit-frame-pointer" --with-lz4  --with-liburing

Output from configure includes:
checking whether to build with liburing support... yes
checking for liburing... yes

io_uring support was installed via: sudo apt install liburing-dev and I
have 2.1-2build1
libc is Ubuntu GLIBC 2.35-0ubuntu3.10
gcc is 11.4.0

More performance info is here:
https://mdcallag.github.io/reports/25_06_01.pg.all.mem.hetz/all.html#summary

The config files I used only differ WRT io_method
* io_method=sync -
https://github.com/mdcallag/mytools/blob/master/bench/conf/arc/may25.hetzner/pg18b1git_o2nofp/conf.diff.cx10b_c32r128
* io_method=workers -
https://github.com/mdcallag/mytools/blob/master/bench/conf/arc/may25.hetzner/pg18b1git_o2nofp/conf.diff.cx10cw4_c32r128
* io_method=io_uring -
https://github.com/mdcallag/mytools/blob/master/bench/conf/arc/may25.hetzner/pg18b1git_o2nofp/conf.diff.cx10d_c32r128

The symptoms are:
* ~20% reduction in point queries/s with io_method=io_uring vs =sync,
=workers or Postgres 17.4, and the issue here is not that SELECT
performance has changed, it is that my benchmark client sometimes creates
connections in between running queries and the new latency from that for
io_method=io_uring hurts throughput
* CPU/query and context switches /query are similar, with io_uring the
CPU/query might be ~4% larger

>From sampled thread stacks of the postmaster when I use io_uring the common
stack is:
arch_fork,__GI__Fork,__libc_fork,fork_process,postmaster_child_launch,BackendStartup,ServerLoop,PostmasterMain,main

While the typical stack with io_method=sync is:
epoll_wait,WaitEventSetWaitBlock,WaitEventSetWait,ServerLoop,PostmasterMain,main

I run "ps" during each benchmark step and on example of what I see during a
point query benchmarks step (qp100.L2) with io_method=uring is below. The
benchmark step runs for 300 seconds.
---> from the start of the step
mdcallag 3762684  0.9  1.5 103027276 2031612 ?   Ss   03:12   0:14
/home/mdcallag/d/pg18beta1_o2nofp/bin/postgres -D /data/m/pg
---> from the end of the step
mdcallag 3762684 15.9  1.5 103027276 2031612 ?   Rs   03:12   5:04
/home/mdcallag/d/pg18beta1_o2nofp/bin/postgres -D /data/m/pg

And from top I see:
---> with =io_uring
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+
COMMAND
3762684 mdcallag  20   0   98.3g   1.9g   1.9g R  99.4   1.5   3:04.87
/home/mdcallag/d/pg18beta1_o2nofp/bin/postgres -D /data/m/pg

--> with =sync
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+
COMMAND
2913673 mdcallag  20   0   98.3g   1.9g   1.9g S  28.3   1.5   0:54.13
/home/mdcallag/d/pg18beta1_o2nofp/bin/postgres -D /data/m/pg

The postmaster had used 0:14 (14 seconds) of CPU time by the start of the
benchmark step and 5:04 (304 seconds) by the end. For the same step with
io_method=sync it was 0:05 at the start and 1:27 at the end. So the
postmaster used ~290 seconds of cpu with =io_uring vs ~82 with =sync, which
is ~3.5X more CPU on the postmaster per connection attempt.

>From vmstat what I see is that some of the rates (cs = context switches, us
= user CPU) are ~20% smaller with =io_uring, which is reasonable given that
the throughput is also ~20% smaller. But sy (system CPU) is not 20% smaller
because of the overhead from all of those calls to fork (or clone).

Avg rates from vmstat
cs      us      sy      us+sy
492961  25.0    14.0    39.0   --> with =sync
401233  20.1    14.0    34.1   ---> with =io_uring

-- 
Mark Callaghan
mdcal...@gmail.com

Reply via email to