Aurora Postgres version 17.4

This application mainly runs 3 queries very frequently-  around 12000 
executions per second across the 3.  Normally the server handles this fine and 
at any given time there are between 0 - 10 active sessions running the queries. 
 The 3 queries are tuned-  doing 3-7 logical reads per execute and run in < .1 
millisecond.

Periodically there will be a huge spike in active sessions up to 400+ sessions 
all on CPU (it can be higher depending on how many app servers are in the mix 
at the time).  This spike will last 2-5 seconds but it causes an increase in 
app response time which is a problem.

We snap info from pg_stat_activity (where state <> ‘idle' ) every 1 second for 
troubleshooting issues and a couple observations from that snap data when this 
spike happens:


  1.
The app sessions will show as wait_event_type and wait_event as null which I 
understand means they are on CPU.
  2.
Some of those sessions will not have queryid populated.  But all the app 
sessions do is execute queries.
  3.
With aurora Postgres they store plan information in pg_stat_activity in planid 
column.  For some sessions the queryid will be populated but the planid will 
not be populated.  According to AWS this can happen if the session has 
completed parsing the query but has not finished plan generation or sessions 
are in the bind phase where parameter binding occurs before final plan 
selection.

With Aurora Postgres we do not have access to the server to look at the unix 
processes but we enabled enhanced monitoring in 1 second intervals which does 
provide a little info on session processes.  One thing that stood out  when the 
active session spike happens if I looked at the process list (it only shows ~ 
95 of the processes) I would see a bunch like this:

  "vss": 640900784,
            "name": "postgres: prcoffrcnsmruser prcoffrcnsmr 
172.29.145.159(53026) BIND",
            "tgid": 26247,
            "parentID": 584,
            "memoryUsedPc": 0.02,
            "cpuUsedPc": 0.06,
            "id": 26247,
            "rss": 30496,
            "vmlimit": “unlimited"

Where they had BIND next to the process info (vs others will show things like 
idle or SELECT).  In this case of the 95 processes shown it may have shown 59 
of them were idle, 2 were SELECT, 2 were PARSE, and 32 were BIND.

So based on these processes showing BIND and pg_stat_activity either not 
showing a queryid or planid if it did show a queryid it seems these sessions 
may be getting hung up parsing the queries or specifically at the step it does 
binding.

Can anyone come up with any theories what may be going on here or is my 
description/explanation above too vague?  The app is using prepared statements. 
 The connection pool is HikariCP which does not support caching prepared 
statements.  The thing that throws me off is so far I’m unable to see anything 
the app is doing to cause this.  I don’t see a spike in query activity.  I 
don’t see a spike in new connections from the pools.  I don’t see anything 
leading up to this other than this spike in sessions out of nowhere.  I don’t 
see any resource issues on the server.  It will run this same load consistently 
throughout the day and the active session spike may happen 0 - 10 times in a 
day.

And ideas would be appreciated.

Reply via email to