The Problem
While using the following to clean up some nodes that I had terminated I ran into an extreme slowdown in performance.
node-ttl = 12h
|
node-purge-ttl = 4h
|
I traced it back to multiple delete statements that were running and I believe they were coming from foreign-key constraints on the certnames table.
pe-puppetdb=# \d+ certnames
|
Table "public.certnames"
|
Column | Type | Modifiers | Storage | Stats target | Description
|
------------------+--------------------------+-------------------------------------------------------+----------+--------------+-------------
|
id | bigint | not null default nextval('certname_id_seq'::regclass) | plain | |
|
certname | text | not null | extended | |
|
latest_report_id | bigint | | plain | |
|
deactivated | timestamp with time zone | | plain | |
|
expired | timestamp with time zone | | plain | |
|
Indexes:
|
"certnames_transform_pkey" PRIMARY KEY, btree (id)
|
"certnames_transform_certname_key" UNIQUE CONSTRAINT, btree (certname)
|
Foreign-key constraints:
|
"certnames_reports_id_fkey" FOREIGN KEY (latest_report_id) REFERENCES reports(id) ON DELETE SET NULL
|
Referenced by:
|
TABLE "catalogs" CONSTRAINT "catalogs_certname_fkey" FOREIGN KEY (certname) REFERENCES certnames(certname) ON DELETE CASCADE
|
TABLE "factsets" CONSTRAINT "factsets_certname_fk" FOREIGN KEY (certname) REFERENCES certnames(certname) ON UPDATE CASCADE ON DELETE CASCADE
|
TABLE "reports" CONSTRAINT "reports_certname_fkey" FOREIGN KEY (certname) REFERENCES certnames(certname) ON DELETE CASCADE
|
Options: autovacuum_vacuum_scale_factor=0.20, autovacuum_analyze_scale_factor=0.10
|
My understanding is that instead of a bulk deletions of reports that are for certnames that are no longer active. We’re simply deleting reports, catalogs, and factsets using a cascading delete from certnames.
This means that if I have 200 certnames to delete I think that postgresql will simply shoot out 200 queries to delete the reports that look something like
DELETE from reports where certname = $1;
|
And another 200 queries to delete catalogs and another 200 to delete fact_sets
|
I think I proved that to myself by getting the query plan for deleting two certnames
pe-puppetdb=# EXPLAIN ANALYZE DELETE from certnames where certname IN ( 'mono-scale-agent-61.us-west-2.compute.internal', 'mono-scale-agent-60.us-west-2.compute.internal');
|
QUERY PLAN
|
--------------------------------------------------------------------------------------------------------------------------------------------------------
|
Delete on certnames (cost=8.84..16.45 rows=2 width=6) (actual time=0.116..0.116 rows=0 loops=1)
|
-> Bitmap Heap Scan on certnames (cost=8.84..16.45 rows=2 width=6) (actual time=0.050..0.053 rows=2 loops=1)
|
Recheck Cond: (certname = ANY ('{mono-scale-agent-61.us-west-2.compute.internal,mono-scale-agent-60.us-west-2.compute.internal}'::text[]))
|
Heap Blocks: exact=2
|
-> Bitmap Index Scan on certnames_transform_certname_key (cost=0.00..8.84 rows=2 width=0) (actual time=0.045..0.045 rows=2 loops=1)
|
Index Cond: (certname = ANY ('{mono-scale-agent-61.us-west-2.compute.internal,mono-scale-agent-60.us-west-2.compute.internal}'::text[]))
|
Planning time: 0.171 ms
|
Trigger for constraint catalogs_certname_fkey on certnames: time=0.711 calls=2
|
Trigger for constraint factsets_certname_fk on certnames: time=0.468 calls=2
|
Trigger for constraint reports_certname_fkey on certnames: time=4063.079 calls=2
|
Trigger for constraint catalog_resources_catalog_id_fkey on catalogs: time=159.127 calls=2
|
Trigger for constraint factset_id_fk on factsets: time=1.617 calls=2
|
Trigger for constraint certnames_reports_id_fkey on reports: time=491.102 calls=154
|
Trigger for constraint resource_events_report_id_fkey on reports: time=197.813 calls=154
|
Execution time: 4914.278 ms
|
(15 rows)
|
I first made sure those certnames existed and checked that they had reports.
pe-puppetdb=# select count(*) from reports where certname = 'mono-scale-agent-60.us-west-2.compute.internal';
|
count
|
-------
|
77
|
(1 row)
|
pe-puppetdb=# select count(*) from reports where certname = 'mono-scale-agent-61.us-west-2.compute.internal';
|
count
|
-------
|
77
|
(1 row)
|
Assuming that the delete the constraint fires looks like below then we can see the delete from reports also cascades to another two tables ( back to certnames and then also resource_events )
pe-puppetdb=# EXPLAIN ANALYZE DELETE from reports where certname = 'mono-scale-agent-62.us-west-2.compute.internal';
|
QUERY PLAN
|
----------------------------------------------------------------------------------------------------------------------------------------
|
Delete on reports (cost=0.55..189.25 rows=46 width=6) (actual time=49.588..49.588 rows=0 loops=1)
|
-> Index Scan using reports_certname_idx on reports (cost=0.55..189.25 rows=46 width=6) (actual time=0.064..0.423 rows=77 loops=1)
|
Index Cond: (certname = 'mono-scale-agent-62.us-west-2.compute.internal'::text)
|
Planning time: 0.068 ms
|
Trigger for constraint certnames_reports_id_fkey: time=179.471 calls=77
|
Trigger for constraint resource_events_report_id_fkey: time=333.978 calls=77
|
Execution time: 563.123 ms
|
(7 rows)
|
While all of these are executing fast now that’s likely because I have more memory than the size of the database currently. I can see how all of these small queries would grind to a halt if they had to go to disk for any reason.
Suggestion:
Remove all cascading DELETEs and UDPATEs and instead opt for 1-time bulk queries that resolve the issues the constraints are meant to address.
In some cases, such as reports and resource_events, the situation would resolve itself when the reports are deleted via report-ttl and that would be an efficient bulk deletion.
We could also move to a delete that looks something like this.
DELETE FROM reports
|
WHERE reports.id IN ( select reports.id
|
from reports
|
join certnames
|
on reports.certname = certnames.certname
|
where ( certnames.deactivated is not null or certnames.expired is not null)
|
and (certnames.deactivated < now() - node_purge_ttl
|
or certnames.expired < now() - node_purge_ttl)
|
);
|
We would need to create deletes for catalogs, resource_events and other tables that currently have cascading delete.
Other Thoughts
I think this is likely uncommon as I’m not sure how many people use node_purge_ttl and if they do then they probably don’t have 100s of nodes to purge at a time. Even so I think there are benefits to moving to bulk deletes and updates or just doing nothing at all in some cases instead of letting these cascading deletes and updates happen as they generate a lot of small queries that can back up when the database is busy.
|