For weeks now, I am banging my head at an "out of memory" situation.
There is only one query I am running on an 8 GB system, whatever I try,
I get knocked out on this out of memory. It is extremely impenetrable to
understand and fix this error. I guess I could add a swap file, and then
I would have to take the penalty of swapping. But how can I actually
address an out of memory condition if the system doesn't tell me where
it is happening?
You might want to see the query, but it is a huge plan, and I can't
really break this down. It shouldn't matter though. But just so you can
get a glimpse here is the plan:
Insert on businessoperation (cost=5358849.28..5361878.44 rows=34619 width=1197)
-> Unique (cost=5358849.28..5361532.25 rows=34619 width=1197)
-> Sort (cost=5358849.28..5358935.83 rows=34619 width=1197)
Sort Key: documentinformationsubject.documentinternalid,
documentinformationsubject.is_current, documentinformationsubject.documentid,
documentinformationsubject.documenttypecode,
documentinformationsubject.subjectroleinternalid,
documentinformationsubject.subjectentityinternalid,
documentinformationsubject.subjectentityid,
documentinformationsubject.subjectentityidroot,
documentinformationsubject.subjectentityname,
documentinformationsubject.subjectentitytel,
documentinformationsubject.subjectentityemail,
documentinformationsubject.otherentityinternalid,
documentinformationsubject.confidentialitycode,
documentinformationsubject.actinternalid, documentinformationsubject.code_code,
documentinformationsubject.code_displayname, q.code_code, q.code_displayname,
an.extension, an.root, documentinformationsubject_2.subjectentitycode,
documentinformationsubject_2.subjectentitycodesystem,
documentinformationsubject_2.effectivetime_low,
documentinformationsubject_2.effectivetime_high,
documentinformationsubject_2.statuscode,
documentinformationsubject_2.code_code, agencyid.extension,
agencyname.trivialname, documentinformationsubject_1.subjectentitycode,
documentinformationsubject_1.subjectentityinternalid
-> Nested Loop Left Join (cost=2998335.54..5338133.63
rows=34619 width=1197)
Join Filter:
(((documentinformationsubject.documentinternalid)::text =
(q.documentinternalid)::text) AND
((documentinformationsubject.actinternalid)::text = (r.targetinternalid)::text))
-> Merge Left Join (cost=2998334.98..3011313.54
rows=34619 width=930)
Merge Cond:
(((documentinformationsubject.documentinternalid)::text =
(documentinformationsubject_1.documentinternalid)::text) AND
((documentinformationsubject.documentid)::text =
(documentinformationsubject_1.documentid)::text) AND
((documentinformationsubject.actinternalid)::text =
(documentinformationsubject_1.actinternalid)::text))
-> Sort (cost=1408783.87..1408870.41 rows=34619
width=882)
Sort Key:
documentinformationsubject.documentinternalid,
documentinformationsubject.documentid, documentinformationsubject.actinternalid
-> Seq Scan on documentinformationsubject
(cost=0.00..1392681.22 rows=34619 width=882)
Filter: (((participationtypecode)::text =
ANY ('{PPRF,PRF}'::text[])) AND ((classcode)::text = 'ACT'::text) AND
((moodcode)::text = 'DEF'::text) AND ((code_codesystem)::text =
'2.16.840.1.113883.3.26.1.1'::text))
-> Materialize (cost=1589551.12..1594604.04
rows=1010585 width=159)
-> Sort (cost=1589551.12..1592077.58
rows=1010585 width=159)
Sort Key:
documentinformationsubject_1.documentinternalid,
documentinformationsubject_1.documentid,
documentinformationsubject_1.actinternalid
-> Seq Scan on
documentinformationsubject documentinformationsubject_1 (cost=0.00..1329868.64
rows=1010585 width=159)
Filter:
((participationtypecode)::text = 'PRD'::text)
-> Materialize (cost=0.56..2318944.31 rows=13 width=341)
-> Nested Loop Left Join (cost=0.56..2318944.24
rows=13 width=341)
-> Nested Loop Left Join
(cost=0.00..2318893.27 rows=1 width=281)
Join Filter:
((agencyname.entityinternalid)::text =
(documentinformationsubject_2.otherentityinternalid)::text)
-> Nested Loop Left Join
(cost=0.00..2286828.33 rows=1 width=291)
Join Filter:
((agencyid.entityinternalid)::text =
(documentinformationsubject_2.otherentityinternalid)::text)
-> Nested Loop Left Join
(cost=0.00..2284826.24 rows=1 width=239)
Join Filter:
(((q.documentinternalid)::text =
(documentinformationsubject_2.documentinternalid)::text) AND
((q.actinternalid)::text = (documentinformationsubject_2.actinternalid)::text))
-> Nested Loop
(cost=0.00..954957.59 rows=1 width=136)
Join Filter:
((q.actinternalid)::text = (r.sourceinternalid)::text)
-> Seq Scan on
actrelationship r (cost=0.00..456015.26 rows=1 width=74)
Filter:
((typecode)::text = 'SUBJ'::text)
-> Seq Scan on
documentinformation q (cost=0.00..497440.84 rows=120119 width=99)
Filter:
(((classcode)::text = 'CNTRCT'::text) AND ((moodcode)::text = 'EVN'::text) AND
((code_codesystem)::text = '2.16.840.1.113883.3.26.1.1'::text))
-> Seq Scan on
documentinformationsubject documentinformationsubject_2 (cost=0.00..1329868.64
rows=1 width=177)
Filter:
((participationtypecode)::text = 'AUT'::text)
-> Seq Scan on entity_id agencyid
(cost=0.00..1574.82 rows=34182 width=89)
-> Seq Scan on bestname agencyname
(cost=0.00..27066.08 rows=399908 width=64)
-> Index Scan using act_id_fkidx on act_id an
(cost=0.56..50.85 rows=13 width=134)
Index Cond: ((q.actinternalid)::text =
(actinternalid)::text)
I have monitored the activity with vmstat and iostat, and it looks like
the memory grabbing happens rapidly after a Sort Merge step. I see in
the iostat a heavy read and write activity, which I attribute to a
sort-merge step, then that is followed by a sudden spike in write
activity, and then the out of memory crash.
procs -----------------------memory---------------------- ---swap-- -----io----
-system-- --------cpu-------- -----timestamp-----
r b swpd free buff cache si so bi bo
in cs us sy id wa st UTC
0 2 0 119344 0 7616672 0 0 11681 3107
9 0 6 1 72 21 0 2019-04-14 16:19:52
0 2 0 128884 0 7607288 0 0 2712 55386
500 509 3 2 15 80 0 2019-04-14 16:19:54
0 2 0 116984 0 7619916 0 0 880 59241
548 525 2 2 9 87 0 2019-04-14 16:19:56
0 2 0 131492 0 7604816 0 0 128 56512
518 401 1 1 12 86 0 2019-04-14 16:19:58
...
0 2 0 134508 0 7601480 0 0 0 58562
428 353 0 1 4 95 0 2019-04-14 16:21:46
0 2 0 125360 0 7611320 0 0 0 59392
481 369 0 1 11 89 0 2019-04-14 16:21:48
0 2 0 122896 0 7612872 0 0 0 58564
431 342 0 1 17 82 0 2019-04-14 16:21:50
1 1 0 121456 0 7614248 0 0 54 57347
487 399 0 1 13 85 0 2019-04-14 16:21:52
0 2 0 122820 0 7613324 0 0 12 59964
460 346 0 1 20 79 0 2019-04-14 16:21:54
0 2 0 120344 0 7616528 0 0 1844 55691
645 676 5 3 6 85 0 2019-04-14 16:21:56
0 2 0 124900 0 7611404 0 0 936 58261
795 1215 2 3 13 83 0 2019-04-14 16:21:58
0 2 0 124572 0 7612192 0 0 1096 55340
518 487 1 2 0 97 0 2019-04-14 16:22:00
0 2 0 123040 0 7612740 0 0 888 57574
573 620 1 2 5 92 0 2019-04-14 16:22:02
0 2 0 125112 0 7610592 0 0 124 59164
498 480 1 1 13 85 0 2019-04-14 16:22:04
1 1 0 129440 0 7607592 0 0 568 60196
563 612 2 2 8 88 0 2019-04-14 16:22:06
0 2 0 124020 0 7612364 0 0 0 58260
629 725 0 1 8 91 0 2019-04-14 16:22:08
2 1 0 124480 0 7611848 0 0 0 58852
447 331 0 1 1 98 0 2019-04-14 16:22:10
0 3 0 137636 0 7598484 0 0 11908 44995
619 714 1 1 11 87 0 2019-04-14 16:22:12
0 2 0 123128 0 7613392 0 0 29888 28901
532 972 1 1 29 68 0 2019-04-14 16:22:14
0 2 0 126260 0 7609984 0 0 39872 18836
706 1435 1 2 28 70 0 2019-04-14 16:22:16
0 2 0 130748 0 7605536 0 0 36096 22488
658 1272 2 1 8 89 0 2019-04-14 16:22:18
...
0 2 0 127216 0 7609192 0 0 29192 29696
472 949 1 1 23 75 0 2019-04-14 16:22:40
0 2 0 147428 0 7588556 0 0 29120 29696
523 974 1 1 19 79 0 2019-04-14 16:22:42
0 1 0 120644 0 7615388 0 0 32320 25276
566 998 1 2 49 47 0 2019-04-14 16:22:44
0 1 0 128456 0 7607904 0 0 58624 0
621 1103 3 2 49 46 0 2019-04-14 16:22:46
0 1 0 127836 0 7608260 0 0 58624 0
631 1119 3 2 50 46 0 2019-04-14 16:22:48
0 1 0 126712 0 7609616 0 0 58624 0
616 1110 2 2 50 47 0 2019-04-14 16:22:50
...
0 1 0 157408 0 7578060 0 0 58628 0
736 1206 3 3 50 44 0 2019-04-14 16:27:22
0 1 0 142420 0 7593400 0 0 58688 0
623 1099 1 4 50 45 0 2019-04-14 16:27:24
0 1 0 247016 0 7488184 0 0 58568 0
649 1113 1 4 50 45 0 2019-04-14 16:27:26
0 1 0 123232 0 7612088 0 0 58412 215
675 1141 2 3 50 46 0 2019-04-14 16:27:28
0 2 0 144920 0 7586576 0 0 48376 11046
788 1455 1 5 34 60 0 2019-04-14 16:27:30
1 1 0 125636 0 7595704 0 0 36736 21381
702 1386 1 4 21 74 0 2019-04-14 16:27:32
0 3 0 156700 0 7559328 0 0 35556 23364
709 1367 1 3 22 74 0 2019-04-14 16:27:34
0 2 0 315580 0 7382748 0 0 33608 24731
787 1407 1 5 18 76 0 2019-04-14 16:27:36
...
0 2 0 684412 0 6152040 0 0 29832 28356
528 994 1 2 32 66 0 2019-04-14 16:38:04
0 2 0 563512 0 6272264 0 0 29696 29506
546 987 1 2 32 65 0 2019-04-14 16:38:06
0 2 0 595488 0 6241068 0 0 27292 30858
549 971 1 2 26 71 0 2019-04-14 16:38:08
0 2 0 550120 0 6285352 0 0 28844 29696
567 995 1 2 29 68 0 2019-04-14 16:38:10
1 1 0 432380 0 6402964 0 0 28992 29696
557 979 1 2 37 61 0 2019-04-14 16:38:12
0 2 0 445796 0 6384412 0 0 26768 32134
628 1029 1 4 27 69 0 2019-04-14 16:38:14
0 2 0 374972 0 6453592 0 0 28172 30839
529 962 1 2 43 54 0 2019-04-14 16:38:16
0 2 0 317824 0 6507992 0 0 29172 29386
560 1001 1 3 27 68 0 2019-04-14 16:38:18
0 3 0 215092 0 6609132 0 0 33116 25210
621 1148 1 3 19 77 0 2019-04-14 16:38:20
0 2 0 194836 0 6621524 0 0 27786 30959
704 1152 0 5 18 77 0 2019-04-14 16:38:22
0 3 0 315648 0 6500196 0 0 31434 27226
581 1073 0 3 31 65 0 2019-04-14 16:38:24
*0 2 0 256180 0 6554676 0 0 29828 29017 668 1174 0 4 20 76 0 2019-04-14
16:38:26* <<< CRASH
0 1 0 378220 0 6552496 0 0 4348 53686
2210 3816 1 5 46 49 0 2019-04-14 16:38:28
0 1 0 389888 0 6536296 0 0 2704 56529
2454 4178 0 5 42 52 0 2019-04-14 16:38:30
0 2 0 923572 0 5998992 0 0 1612 56863
2384 3928 0 6 16 78 0 2019-04-14 16:38:32
0 0 0 908336 0 6006696 0 0 3584 49280
8961 17334 0 19 39 42 0 2019-04-14 16:38:34
0 1 0 1306480 0 5607088 0 0 264 63632
18605 37933 3 58 35 4 0 2019-04-14 16:38:36
2 1 0 1355448 0 5558576 0 0 8 59222
14817 30296 2 46 24 27 0 2019-04-14 16:38:38
2 2 0 1358224 0 5555884 0 0 0 58544
14226 28331 2 44 3 50 0 2019-04-14 16:38:40
2 1 0 1446348 0 5468748 0 0 0 58846
14376 29185 2 44 11 42 0 2019-04-14 16:38:42
0 0 0 2639648 0 4357608 0 0 0 28486
12909 26770 2 44 49 5 0 2019-04-14 16:38:44
0 0 0 2639524 0 4357800 0 0 0 0
158 154 0 0 100 0 0 2019-04-14 16:38:46
0 0 0 2687316 0 4309976 0 0 0 0
181 188 0 2 98 0 0 2019-04-14 16:38:48
0 0 0 2706920 0 4300116 0 0 0 105
137 263 0 0 100 0 0 2019-04-14 16:38:50
0 0 0 2706672 0 4300232 0 0 0 0
142 204 0 0 100 0 0 2019-04-14 16:38:52
0 0 0 2815116 0 4191928 0 0 0 0
116 242 0 0 100 0 0 2019-04-14 16:38:54
0 0 0 2815364 0 4192008 0 0 0 0
116 239 0 0 100 0 0 2019-04-14 16:38:56
0 0 0 2815116 0 4192164 0 0 0 0
159 236 0 0 100 0 0 2019-04-14 16:38:58
ending after the out of memory crash, that occurred exactly at the
marked point 16:38:26.355 UTC.
We can't really see anything too worrisome. There is always lots of
memory used by cache, which could have been mobilized. The only possible
explanation I can think of is that in that moment of the crash the
memory utilization suddenly skyrocketed in less than a second, so that
the 2 second vmstat interval wouldn't show it??? Nah.
I have already much reduced work_mem, which has helped in some other
cases before. Now I am going to reduce the shared_buffers now, but that
seems counter-intuitive because we are sitting on all that cache memory
unused!
Might this be a bug? It feels like a bug. It feels like those out of
memory issues should be handled more gracefully (garbage collection
attempt?) and that somehow there should be more information so the
person can do anything about it.
Any ideas?
-Gunther