Hi hackers! Recently I observed very peculiar incident.
=== Incident description ===
ETL database was operating fine for many months, regularly updated etc.
Workload was not changing much, but as far as it was ETL database - most of
queries were different all the time.
On the night of September 7th database was stuck, no monitoring query could be
executed. DBAs started to deal with the incident, but there is not much to do
with database service, when you cannot execute a single query. According to VM
metrics, VM was writing a lot on disk.
On-call engineer was summoned. He observed a lot of backends stuck with similar
backtrace
#5 LWLockAcquire()
#6 pgss_store()
#7 pgss_post_parse_analyze()
#8 parse_analyze()
#9 pg_analyze_and_rewrite()
After restart, problem reproduced within 50 minutes. But monitoring queries
were operating, what showed that all backends were stuck in LWLock
pg_stat_statements. It was impossible to disable pgss with SQL, so engineer
altered auto.conf and restarted database. This resolved the incident.
Later I was working on analyzing the incident. Enabling pgss back showed traces
of the problem:
Fri 09 Sep 2022 08:52:31 AM MSK (every 2s)
usename | state | wait_event | cnt
-----------------+---------------------+--------------------+-----
content | active | DataFileRead | 1
content | active | pg_stat_statements | 42
content | idle in transaction | ClientRead | 2
pgwatch_monitor | active | BufFileRead | 1
pgwatch_monitor | active | [null] | 5
pgwatch_monitor | active | pg_stat_statements | 85
postgres | active | [null] | 1
repl | active | WalSenderMain | 2
[null] | active | [null] | 2
[null] | active | VacuumDelay | 7
(10 rows)
pgwatch was quering 60 databases, every minute and each call to
pg_stat_statements() took approximately 3-4 seconds.
Backend that was in charge of grand lock was looking like this in
pg_stat_statements:
datid | 127782
pid | 4077900
usename | pgwatch_monitor
application_name | pgwatch2 - 10.96.17.68
wait_event_type | IO
wait_event | BufFileWrite
state | active
backend_xid | [null]
backend_xmin | 67048029
The contents of pg_stat_statements view overrun work_mem and were materialized
in tuplestore on disk. This is what cause a lot of disk write on database that
was not accepting any user query.
</Incident description>
TLDR: LWLock "pg_stat_statements" disabled all SQL queries.
=== How the problem develops ===
Prerequisite 1. pgwatch is quering pgss often.
Prerequisite 2. pgss becomes big so that tuplestore is written on disk, while
holding shared lock.
Prerequisite 3. Someone is calling reset() or pgss_store() needing exclusive
lock.
Consequence. Exclusive lock queues after long held shared lock and prevents all
shared locks to be taken.
Result. Any query calling pgss hooks hangs.
=== Reproduction for development purposes ===
0. Setup a database with pg_stat_statements.track = all.
1. Modify pg_stat_statements_internal() to wait for a long time under
LWLockAcquire(pgss->lock, LW_SHARED).
2. select * from pg_stat_statements()
3. select pg_stat_statements_reset()
Now the database is bonked. Any query will hang until pg_stat_statements()
finishes.
=== How to fix ===
pgss uses LWLock to protect hashtable.
When the hashtable is reset or new user query is inserted in pgss_store() -
exclusive lock is used.
When stats are updated for known query or pg_stat_statements are read - shared
lock is used.
I propose to swap shared lock for stats update to conditional shared lock.
It cannot be taken during reset() - and that's totally fine. It cannot be taken
during entering new query - and I think it's OK to spill some stats in this
case. PFA patch attached.
This patch prevents from a disaster incurred by described here locking.
=== Other possible mitigation ===
The incident would not occur if tuplestore did not convert into on-disk
representation. But I don't see how to reliably protect from this.
What do you think?
Thank!
Best regards, Andrey Borodin.
v1-0001-Demonstrate-and-fix-lock-of-all-SQL-queries-by-pg.patch
Description: Binary data
