Hi, During investigation of the OOM server crash (with more than 30GB available memory) I found it was triggered by a single query.
related settings: work_mem=32MB hash_mem_multiplier=1 max_parallel_workers_per_gather=3 Now the result of the investigation performed on an idle server with no other activity on it. Now to results: Query and explain (analyze, costs, buffers, timing) results: explain (analyze, costs, buffers, timing) select * from employer_manager WHERE employer_id IN ( SELECT employer_id FROM employer e WHERE EXISTS ( SELECT 1 FROM employer_service es JOIN account_service_package_entry aspe ON (aspe.service_package_id = es.employer_service_id) JOIN account_service_event ase ON (ase.account_service_package_entry_id = aspe.id) JOIN vacancy v ON (ase.object_id = v.vacancy_id) WHERE v.employer_id = e.employer_id AND es.price = 0 ) ); QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Gather (cost=12483578.55..12699754.67 rows=4951913 width=183) (actual time=402888.266..541201.083 rows=3456146 loops=1) Workers Planned: 3 Workers Launched: 3 Buffers: shared hit=10227924, temp read=5404033 written=9746488 -> Parallel Hash Semi Join (cost=12483478.55..12650135.54 rows=1597391 width=183) (actual time=402879.754..403927.356 rows=864036 loops=4) Hash Cond: (employer_manager.employer_id = v.employer_id) Buffers: shared hit=10227924, temp read=5404033 written=9746488 -> Parallel Seq Scan on employer_manager (cost=0.00..61704.37 rows=3902987 width=183) (actual time=0.015..926.040 rows=3030730 loops=4) Buffers: shared hit=226745 -> Parallel Hash (cost=12470696.28..12470696.28 rows=1173692 width=8) (actual time=400745.411..400757.615 rows=315704 loops=4) Buckets: 1048576 Batches: 8 Memory Usage: 14432kB Buffers: shared hit=10000933, temp read=5185403 written=9531604 -> Parallel Hash Semi Join (cost=5419572.63..12470696.28 rows=1173692 width=8) (actual time=381755.186..400673.065 rows=315704 loops=4) Hash Cond: (e.employer_id = v.employer_id) Buffers: shared hit=10000933, temp read=5185403 written=9527768 -> Parallel Seq Scan on employer e (cost=0.00..65876.37 rows=2986447 width=4) (actual time=0.015..547.631 rows=2313235 loops=4) Buffers: shared hit=360119 -> Parallel Hash (cost=4389870.76..4389870.76 rows=94549378 width=4) (actual time=300367.731..300370.699 rows=10515574 loops=4) Buckets: 1048576 (originally 1048576) Batches: 131072 (originally 512) Memory Usage: 287136kB Buffers: shared hit=9640814, temp read=4068338 written=7441540 -> Parallel Hash Join (cost=1522656.32..4389870.76 rows=94549378 width=4) (actual time=113906.860..121650.966 rows=10515574 loops=4) Hash Cond: (aspe.service_package_id = es.employer_service_id) Buffers: shared hit=9640814, temp read=3570214 written=3573964 -> Parallel Hash Join (cost=1139337.57..3828356.61 rows=132999548 width=8) (actual time=87792.230..102392.631 rows=60107033 loops=4) Hash Cond: (ase.account_service_package_entry_id = aspe.id) Buffers: shared hit=8740464, temp read=2645196 written=2648224 -> Parallel Hash Join (cost=669912.02..3179027.68 rows=132999548 width=8) (actual time=44539.790..74472.198 rows=60107033 loops=4) Hash Cond: (ase.object_id = v.vacancy_id) Buffers: shared hit=7347306, temp read=1656073 written=1657576 -> Parallel Seq Scan on account_service_event ase (cost=0.00..2275263.13 rows=172006123 width=8) (actual time=0.011..20798.324 rows=133306300 loops=4) Buffers: shared hit=5552019 -> Parallel Hash (cost=414267.15..414267.15 rows=23473845 width=8) (actual time=7923.194..7923.195 rows=18198520 loops=4) Buckets: 1048576 Batches: 128 Memory Usage: 30528kB Buffers: shared hit=1795287, temp written=247932 -> Parallel Seq Scan on vacancy v (cost=0.00..414267.15 rows=23473845 width=8) (actual time=0.011..5221.328 rows=18198520 loops=4) Buffers: shared hit=1795287 -> Parallel Hash (cost=297333.92..297333.92 rows=15801812 width=8) (actual time=4352.140..4352.141 rows=12280833 loops=4) Buckets: 1048576 Batches: 128 Memory Usage: 23296kB Buffers: shared hit=1393158, temp written=167644 -> Parallel Seq Scan on account_service_package_entry aspe (cost=0.00..297333.92 rows=15801812 width=8) (actual time=0.015..2520.045 rows=12280833 loops=4) Buffers: shared hit=1393158 -> Parallel Hash (cost=258846.24..258846.24 rows=11429325 width=4) (actual time=3424.467..3424.468 rows=8908616 loops=4) Buckets: 1048576 Batches: 64 Memory Usage: 30080kB Buffers: shared hit=900350, temp written=103252 -> Parallel Seq Scan on employer_service es (cost=0.00..258846.24 rows=11429325 width=4) (actual time=0.010..2087.361 rows=8908616 loops=4) Filter: (price = 0) Rows Removed by Filter: 3624132 Buffers: shared hit=900350 Planning: Buffers: shared hit=236 Planning Time: 25.494 ms Execution Time: 541359.085 ms There are no big missestimations in rows, but during moments of high memory usage each parallel worker used over 5GB of RAM (so total over 20GB of RAM 3 worker + leader). Memory context results of single worker: TopMemoryContext: 3357032 total in 16 blocks; 46064 free (16 chunks); 3310968 used ... TopPortalContext: 8192 total in 1 blocks; 7656 free (0 chunks); 536 used PortalHoldContext: 24632 total in 2 blocks; 7384 free (0 chunks); 17248 used PortalContext: 540672 total in 69 blocks; 8568 free (10 chunks); 532104 used: ExecutorState: 1082567568 total in 124072 blocks; 4073608 free (7018 chunks); 1078493960 used HashTableContext: 74808 total in 5 blocks; 25008 free (3 chunks); 49800 used HashBatchContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used HashTableContext: 74808 total in 5 blocks; 25056 free (3 chunks); 49752 used HashBatchContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used HashTableContext: 40960 total in 3 blocks; 12328 free (6 chunks); 28632 used HashBatchContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used HashTableContext: 4156883208 total in 124074 blocks; 146264 free (30 chunks); 4156736944 used HashBatchContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used HashTableContext: 8192 total in 1 blocks; 4376 free (7 chunks); 3816 used HashBatchContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used ... ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used Relcache by OID: 16384 total in 2 blocks; 1424 free (2 chunks); 14960 used CacheMemoryContext: 1163704 total in 14 blocks; 197832 free (0 chunks); 965872 used relation rules: 24576 total in 6 blocks; 4896 free (1 chunks); 19680 used: pg_settings index info: 2048 total in 2 blocks; 528 free (1 chunks); 1520 used: pg_toast_2619_index .... 48 more child contexts containing 91136 total in 83 blocks; 27632 free (45 chunks); 63504 used PrivateRefCount: 8192 total in 1 blocks; 2616 free (0 chunks); 5576 used MdSmgr: 8192 total in 1 blocks; 3880 free (2 chunks); 4312 used LOCALLOCK hash: 32768 total in 3 blocks; 16824 free (8 chunks); 15944 used Timezones: 104120 total in 2 blocks; 2616 free (0 chunks); 101504 used ErrorContext: 8192 total in 1 blocks; 7928 free (4 chunks); 264 used Grand total: 5245755200 bytes in 248596 blocks; 4995280 free (7224 chunks); 5240759920 used Also during query execution there was huge amount of small temp files: find ~postgres/13/main/base/pgsql_tmp/ -type f | wc -l 1459383 There are two questions: 1)is this behavior normal/expected? 2)is there any way to limit memory usage of such queries to prevent OOM in random times (memory usage of almost 200x of work_mem very surprised me). PS: after crash recovery of a database with a few millions of files in pgsql_tmp takes ages (however it is fixed in newer versions so it's not a critical issue). -- Maxim Boguk Senior Postgresql DBA https://dataegret.com/ Phone UA: +380 99 143 0000 Phone AU: +61 45 218 5678