On Sat, Jan 9, 2021 at 2:07 PM Jeff Janes <jeff.ja...@gmail.com> wrote:

>
> How are you monitoring the COMMIT times?  What do you generally see in
> pg_stat_activity.wait_event during the spikes/stalls?
>

Right now we just observe the COMMIT duration posted in the postgresql log
(we log anything over 100ms).

One other thing that I shamefully forgot to mention. When we see these slow
COMMITs in the log, they coincide with a connection storm (Cat 5 hurricane)
from our apps where connections will go from ~200 to ~1200. This will
probably disgust many, but our PG server's max_connections is set to 2000.
We have a set of pgbouncers in front of this with a total
max_db_connections of 1600. I know many of you think this defeats the whole
purpose of having pgbouncer and I agree. I've been trying to explain as
much and that even with 32 CPUs on this DB host, we probably shouldn't
expect to be able to support more than 100-200 active connections, let
alone 1600. I'm still pushing to have our app server instances (which also
use their own JDBC (Hikari) connection pool and *then* go through
pgbouncer) to lower their min/max connection settings but obviously it's
sort of counterintuitive at first glance but hopefully everyone sees the
bigger picture.

One nagging question I have is if the slow COMMIT is triggering the
connection storm (eg app sees slow response or timeout from a current
connection and fires off a new connection in its place), or vice-versa.
We're planning to deploy new performant cloud storage (Azure Ultra disk)
just for WAL logs but I'm hesitant to say it'll be a silver bullet when we
still have this insane connection management strategy in place.

Curious to know what others think (please pull no punches) and if others
have been in a similar scenario with anecdotes to share.

Thanks,
Don.

-- 
Don Seiler
www.seiler.us

Reply via email to