pá 22. 8. 2025 v 17:03 odesílatel hubert depesz lubaczewski < dep...@depesz.com> napsal:
> On Fri, Aug 22, 2025 at 07:33:03AM -0700, Adrian Klaver wrote: > > On 8/22/25 05:37, hubert depesz lubaczewski wrote: > > > On Thu, Aug 21, 2025 at 11:17:27AM -0700, Adrian Klaver wrote: > > > > Have you looked at?: > > > > https://www.pgbouncer.org/changelog.html#pgbouncer-124x > > > > To see if anything stands out. > > > > Then there is: > > > > https://www.pgbouncer.org/config.html#max_prepared_statements > > > > The below may also be worth looking at: > > > > https://github.com/pgbouncer/pgbouncer/pull/1144 > > > > I can't help thinking that there is a caching issue at stake, though > that is > > > > just a guess. > > > > > > Will check, but we don't use prepared statements, so this doesn't seem > > > to be part-of-the-problem. > > > > In your OP there is: > > > > "Every now and then (usually every 3-5 minutes, but not through the whole > > day), we see situations where every query suddently takes ~ 1 second. > > > > > > I'm talkign about "prepare" for preparing statements. binds. also > > "discard all". > > " > > > > Is that referring to something else? > > No. Same problem. The thing is that it comes and goes. > > *BUT* > > I got repeatable case today. Is is breaking on its own everyy > ~ 5 minutes. > > So I made test.sql with: > \timing on > SELECT 'SELECT now(), pg_sleep(.05);' FROM generate_series(1,100000) i > \gexec > > and ran it like this: > > psql -p 5432 -d the_db -f test.sql -qAtX | perl -pe 's/\|\s*$/ :: /' > > This bypassed pgbouncer, connected via unix socket, the whole thing got > simpler. > > After 4 minutes of 50ms runtimes I got: > > 2025-08-22 14:44:14.013541+00 :: Time: 50.090 ms > 2025-08-22 14:44:14.063636+00 :: Time: 50.088 ms > 2025-08-22 14:44:14.11373+00 :: Time: 50.085 ms > 2025-08-22 14:44:14.163822+00 :: Time: 50.099 ms > 2025-08-22 14:44:14.213927+00 :: Time: 1393.932 ms (00:01.394) > 2025-08-22 14:44:15.607872+00 :: Time: 50.171 ms > 2025-08-22 14:44:15.658097+00 :: Time: 50.209 ms > 2025-08-22 14:44:15.708353+00 :: Time: 50.212 ms > 2025-08-22 14:44:15.758589+00 :: Time: 50.170 ms > > This is pg_sleep. There are no locks, nothing else. I think this > disproves idea that the problem is about pgbouncer, and it kinda looks > like a thing in pg? > > Interestingly: ran strace on the pg backend that was doing this loop, > like this: > strace -tt -T -ff -p 1266521 -s 256 -o a.log > > And it showed something really strange: > > … > 14:57:00.864684 recvfrom(9, "Q\0\0\0CSELECT now()::timestamptz(2), > pg_backend_pid(), pg_sleep(.05);\0", 8192, 0, NULL, NULL) = 68 <0.000007> > 14:57:00.864739 epoll_pwait(11, [], 1, 50, NULL, 8) = 0 <0.050063> > 14:57:00.914823 sendto(9, > "T\0\0\0X\0\3now\0\0\0\0\0\0\0\0\0\4\240\0\10\0\0\0\2\0\0pg_backend_pid\0\0\0\0\0\0\0\0\0\0\27\0\4\377\377\377\377\0\0pg_sleep\0\0\0\0\0\0\0\0\0\10\346\0\4\377\377\377\377\0\0D\0\0\0002\0\3\0\0\0\0312025-08-22 > 14:57:00.86+00\0\0\0\0071266521\0\0\0\0C\0\0\0\14:57:00.914849 recvfrom(9, > "Q\0\0\0CSELECT now()::timestamptz(2), pg_backend_pid(), pg_sleep(.05);\0", > 8192, 0, NULL, NULL) = 68 <0.000007> > 14:57:00.914891 epoll_pwait(11, [], 1, 50, NULL, 8) = 0 <0.050064> > 14:57:00.964978 sendto(9, > "T\0\0\0X\0\3now\0\0\0\0\0\0\0\0\0\4\240\0\10\0\0\0\2\0\0pg_backend_pid\0\0\0\0\0\0\0\0\0\0\27\0\4\377\377\377\377\0\0pg_sleep\0\0\0\0\0\0\0\0\0\10\346\0\4\377\377\377\377\0\0D\0\0\0002\0\3\0\0\0\0312025-08-22 > 14:57:00.91+00\0\0\0\0071266521\0\0\0\0C\0\0\0\ > 14:57:00.965006 recvfrom(9, "Q\0\0\0CSELECT now()::timestamptz(2), > pg_backend_pid(), pg_sleep(.05);\0", 8192, 0, NULL, NULL) = 68 <0.000007> > > > 14:57:00.965052 epoll_pwait(11, [], 1, 50, NULL, 8) = 0 <0.050063> > 14:57:01.015133 futex(0xffe7a6636538, > FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY) = > 0 <1.354927> > 14:57:02.370180 write(2, "\0\0\36\1YS\23\0T2025-08-22 14:57:02.370 > UTC,\"postgres\",\"the.db\",1266521,\"[local]\",68a884fb.135359,5,\"SELECT\",2025-08-22 > 14:55:55 UTC,85/0,0,LOG,00000,\"duration: 1405.106 ms statement: SELECT > now()::timestamptz(2), pg_backend_pid(), pg_sleep(.05); > 14:57:02.370521 sendto(8, > "\2\0\0\0@\0\0\0\216A\0\0\0\0\0\0\t\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\255\221\33\0\0\0\0\0\262\341\6\0\0\0\0\0\0\0\0\0\0\0\0\0", > 64, 0, NULL, 0) = 64 <0.000041> > 14:57:02.370639 sendto(9, > "T\0\0\0X\0\3now\0\0\0\0\0\0\0\0\0\4\240\0\10\0\0\0\2\0\0pg_backend_pid\0\0\0\0\0\0\0\0\0\0\27\0\4\377\377\377\377\0\0pg_sleep\0\0\0\0\0\0\0\0\0\10\346\0\4\377\377\377\377\0\0D\0\0\0002\0\3\0\0\0\0312025-08-22 > 14:57:00.97+00\0\0\0\0071266521\0\0\0\0C\0\0\0\14:57:02.371063 recvfrom(9, > "Q\0\0\0CSELECT now()::timestamptz(2), pg_backend_pid(), pg_sleep(.05);\0", > 8192, 0, NULL, NULL) = 68 <0.000019> > 14:57:02.371215 epoll_pwait(11, [], 1, 50, NULL, 8) = 0 <0.050090> > 14:57:02.421381 sendto(9, > "T\0\0\0X\0\3now\0\0\0\0\0\0\0\0\0\4\240\0\10\0\0\0\2\0\0pg_backend_pid\0\0\0\0\0\0\0\0\0\0\27\0\4\377\377\377\377\0\0pg_sleep\0\0\0\0\0\0\0\0\0\10\346\0\4\377\377\377\377\0\0D\0\0\0002\0\3\0\0\0\0312025-08-22 > 14:57:02.37+00\0\0\0\0071266521\0\0\0\0C\0\0\0\ > 14:57:02.421480 recvfrom(9, 0xaaaae7b3fc08, 8192, 0, NULL, NULL) = -1 > EAGAIN (Resource temporarily unavailable) <0.000008> > > > 14:57:02.421507 epoll_pwait(4, [{EPOLLIN, {u32=470567512, > u64=187651886697048}}], 1, -1, NULL, 8) = 1 <0.000037> > 14:57:02.421570 recvfrom(9, "Q\0\0\0CSELECT now()::timestamptz(2), > pg_backend_pid(), pg_sleep(.05);\0", 8192, 0, NULL, NULL) = 68 <0.000012> > 14:57:02.421673 epoll_pwait(11, [], 1, 50, NULL, 8) = 0 <0.050099> > 14:57:02.471866 sendto(9, > "T\0\0\0X\0\3now\0\0\0\0\0\0\0\0\0\4\240\0\10\0\0\0\2\0\0pg_backend_pid\0\0\0\0\0\0\0\0\0\0\27\0\4\377\377\377\377\0\0pg_sleep\0\0\0\0\0\0\0\0\0\10\346\0\4\377\377\377\377\0\0D\0\0\0002\0\3\0\0\0\0312025-08-22 > 14:57:02.42+00\0\0\0\0071266521\0\0\0\0C\0\0\0\14:57:02.471961 recvfrom(9, > 0xaaaae7b3fc08, 8192, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily > unavailable) <0.000008> > … > > As in, normal execution is: recv from, epoll_pwait that takes 0.05s, and > sendto. > all clear, nothing fancy. > > *BUT* > in case of this one problematic thing there is futex() call that took > 1.354s ?! Checked, and in straced ~ 800 queries this is the only futex > call() ?! > > Any idea on what could it be? > Few years ago I had a similar experience - some sec mystic lags - the problem was in virtualization. Regards Pavel > > Best regards, > > depesz > > > >