For weeks now, I am banging my head at an "out of memory" situation. There is only one query I am running on an 8 GB system, whatever I try, I get knocked out on this out of memory. It is extremely impenetrable to understand and fix this error. I guess I could add a swap file, and then I would have to take the penalty of swapping. But how can I actually address an out of memory condition if the system doesn't tell me where it is happening?

You might want to see the query, but it is a huge plan, and I can't really break this down. It shouldn't matter though. But just so you can get a glimpse here is the plan:

Insert on businessoperation  (cost=5358849.28..5361878.44 rows=34619 width=1197)
  ->  Unique  (cost=5358849.28..5361532.25 rows=34619 width=1197)
        ->  Sort  (cost=5358849.28..5358935.83 rows=34619 width=1197)
              Sort Key: documentinformationsubject.documentinternalid, 
documentinformationsubject.is_current, documentinformationsubject.documentid, 
documentinformationsubject.documenttypecode, 
documentinformationsubject.subjectroleinternalid, 
documentinformationsubject.subjectentityinternalid, 
documentinformationsubject.subjectentityid, 
documentinformationsubject.subjectentityidroot, 
documentinformationsubject.subjectentityname, 
documentinformationsubject.subjectentitytel, 
documentinformationsubject.subjectentityemail, 
documentinformationsubject.otherentityinternalid, 
documentinformationsubject.confidentialitycode, 
documentinformationsubject.actinternalid, documentinformationsubject.code_code, 
documentinformationsubject.code_displayname, q.code_code, q.code_displayname, 
an.extension, an.root, documentinformationsubject_2.subjectentitycode, 
documentinformationsubject_2.subjectentitycodesystem, 
documentinformationsubject_2.effectivetime_low, 
documentinformationsubject_2.effectivetime_high, 
documentinformationsubject_2.statuscode, 
documentinformationsubject_2.code_code, agencyid.extension, 
agencyname.trivialname, documentinformationsubject_1.subjectentitycode, 
documentinformationsubject_1.subjectentityinternalid
              ->  Nested Loop Left Join  (cost=2998335.54..5338133.63 
rows=34619 width=1197)
                    Join Filter: 
(((documentinformationsubject.documentinternalid)::text = 
(q.documentinternalid)::text) AND 
((documentinformationsubject.actinternalid)::text = (r.targetinternalid)::text))
                    ->  Merge Left Join  (cost=2998334.98..3011313.54 
rows=34619 width=930)
                          Merge Cond: 
(((documentinformationsubject.documentinternalid)::text = 
(documentinformationsubject_1.documentinternalid)::text) AND 
((documentinformationsubject.documentid)::text = 
(documentinformationsubject_1.documentid)::text) AND 
((documentinformationsubject.actinternalid)::text = 
(documentinformationsubject_1.actinternalid)::text))
                          ->  Sort  (cost=1408783.87..1408870.41 rows=34619 
width=882)
                                Sort Key: 
documentinformationsubject.documentinternalid, 
documentinformationsubject.documentid, documentinformationsubject.actinternalid
                                ->  Seq Scan on documentinformationsubject  
(cost=0.00..1392681.22 rows=34619 width=882)
                                      Filter: (((participationtypecode)::text = 
ANY ('{PPRF,PRF}'::text[])) AND ((classcode)::text = 'ACT'::text) AND 
((moodcode)::text = 'DEF'::text) AND ((code_codesystem)::text = 
'2.16.840.1.113883.3.26.1.1'::text))
                          ->  Materialize  (cost=1589551.12..1594604.04 
rows=1010585 width=159)
                                ->  Sort  (cost=1589551.12..1592077.58 
rows=1010585 width=159)
                                      Sort Key: 
documentinformationsubject_1.documentinternalid, 
documentinformationsubject_1.documentid, 
documentinformationsubject_1.actinternalid
                                      ->  Seq Scan on 
documentinformationsubject documentinformationsubject_1  (cost=0.00..1329868.64 
rows=1010585 width=159)
                                            Filter: 
((participationtypecode)::text = 'PRD'::text)
                    ->  Materialize  (cost=0.56..2318944.31 rows=13 width=341)
                          ->  Nested Loop Left Join  (cost=0.56..2318944.24 
rows=13 width=341)
                                ->  Nested Loop Left Join  
(cost=0.00..2318893.27 rows=1 width=281)
                                      Join Filter: 
((agencyname.entityinternalid)::text = 
(documentinformationsubject_2.otherentityinternalid)::text)
                                      ->  Nested Loop Left Join  
(cost=0.00..2286828.33 rows=1 width=291)
                                            Join Filter: 
((agencyid.entityinternalid)::text = 
(documentinformationsubject_2.otherentityinternalid)::text)
                                            ->  Nested Loop Left Join  
(cost=0.00..2284826.24 rows=1 width=239)
                                                  Join Filter: 
(((q.documentinternalid)::text = 
(documentinformationsubject_2.documentinternalid)::text) AND 
((q.actinternalid)::text = (documentinformationsubject_2.actinternalid)::text))
                                                  ->  Nested Loop  
(cost=0.00..954957.59 rows=1 width=136)
                                                        Join Filter: 
((q.actinternalid)::text = (r.sourceinternalid)::text)
                                                        ->  Seq Scan on 
actrelationship r  (cost=0.00..456015.26 rows=1 width=74)
                                                              Filter: 
((typecode)::text = 'SUBJ'::text)
                                                        ->  Seq Scan on 
documentinformation q  (cost=0.00..497440.84 rows=120119 width=99)
                                                              Filter: 
(((classcode)::text = 'CNTRCT'::text) AND ((moodcode)::text = 'EVN'::text) AND 
((code_codesystem)::text = '2.16.840.1.113883.3.26.1.1'::text))
                                                  ->  Seq Scan on 
documentinformationsubject documentinformationsubject_2  (cost=0.00..1329868.64 
rows=1 width=177)
                                                        Filter: 
((participationtypecode)::text = 'AUT'::text)
                                            ->  Seq Scan on entity_id agencyid  
(cost=0.00..1574.82 rows=34182 width=89)
                                      ->  Seq Scan on bestname agencyname  
(cost=0.00..27066.08 rows=399908 width=64)
                                ->  Index Scan using act_id_fkidx on act_id an  
(cost=0.56..50.85 rows=13 width=134)
                                      Index Cond: ((q.actinternalid)::text = 
(actinternalid)::text)

I have monitored the activity with vmstat and iostat, and it looks like the memory grabbing happens rapidly after a Sort Merge step. I see in the iostat a heavy read and write activity, which I attribute to a sort-merge step, then that is followed by a sudden spike in write activity, and then the out of memory crash.

procs -----------------------memory---------------------- ---swap-- -----io---- 
-system-- --------cpu-------- -----timestamp-----
 r  b         swpd         free         buff        cache   si   so    bi    bo 
  in   cs  us  sy  id  wa  st                 UTC
 0  2            0       119344            0      7616672    0    0 11681  3107 
   9    0   6   1  72  21   0 2019-04-14 16:19:52
 0  2            0       128884            0      7607288    0    0  2712 55386 
 500  509   3   2  15  80   0 2019-04-14 16:19:54
 0  2            0       116984            0      7619916    0    0   880 59241 
 548  525   2   2   9  87   0 2019-04-14 16:19:56
 0  2            0       131492            0      7604816    0    0   128 56512 
 518  401   1   1  12  86   0 2019-04-14 16:19:58
 ...
 0  2            0       134508            0      7601480    0    0     0 58562 
 428  353   0   1   4  95   0 2019-04-14 16:21:46
 0  2            0       125360            0      7611320    0    0     0 59392 
 481  369   0   1  11  89   0 2019-04-14 16:21:48
 0  2            0       122896            0      7612872    0    0     0 58564 
 431  342   0   1  17  82   0 2019-04-14 16:21:50
 1  1            0       121456            0      7614248    0    0    54 57347 
 487  399   0   1  13  85   0 2019-04-14 16:21:52
 0  2            0       122820            0      7613324    0    0    12 59964 
 460  346   0   1  20  79   0 2019-04-14 16:21:54
 0  2            0       120344            0      7616528    0    0  1844 55691 
 645  676   5   3   6  85   0 2019-04-14 16:21:56
 0  2            0       124900            0      7611404    0    0   936 58261 
 795 1215   2   3  13  83   0 2019-04-14 16:21:58
 0  2            0       124572            0      7612192    0    0  1096 55340 
 518  487   1   2   0  97   0 2019-04-14 16:22:00
 0  2            0       123040            0      7612740    0    0   888 57574 
 573  620   1   2   5  92   0 2019-04-14 16:22:02
 0  2            0       125112            0      7610592    0    0   124 59164 
 498  480   1   1  13  85   0 2019-04-14 16:22:04
 1  1            0       129440            0      7607592    0    0   568 60196 
 563  612   2   2   8  88   0 2019-04-14 16:22:06
 0  2            0       124020            0      7612364    0    0     0 58260 
 629  725   0   1   8  91   0 2019-04-14 16:22:08
 2  1            0       124480            0      7611848    0    0     0 58852 
 447  331   0   1   1  98   0 2019-04-14 16:22:10
 0  3            0       137636            0      7598484    0    0 11908 44995 
 619  714   1   1  11  87   0 2019-04-14 16:22:12
 0  2            0       123128            0      7613392    0    0 29888 28901 
 532  972   1   1  29  68   0 2019-04-14 16:22:14
 0  2            0       126260            0      7609984    0    0 39872 18836 
 706 1435   1   2  28  70   0 2019-04-14 16:22:16
 0  2            0       130748            0      7605536    0    0 36096 22488 
 658 1272   2   1   8  89   0 2019-04-14 16:22:18
...
 0  2            0       127216            0      7609192    0    0 29192 29696 
 472  949   1   1  23  75   0 2019-04-14 16:22:40
 0  2            0       147428            0      7588556    0    0 29120 29696 
 523  974   1   1  19  79   0 2019-04-14 16:22:42
 0  1            0       120644            0      7615388    0    0 32320 25276 
 566  998   1   2  49  47   0 2019-04-14 16:22:44
 0  1            0       128456            0      7607904    0    0 58624     0 
 621 1103   3   2  49  46   0 2019-04-14 16:22:46
 0  1            0       127836            0      7608260    0    0 58624     0 
 631 1119   3   2  50  46   0 2019-04-14 16:22:48
 0  1            0       126712            0      7609616    0    0 58624     0 
 616 1110   2   2  50  47   0 2019-04-14 16:22:50
...
 0  1            0       157408            0      7578060    0    0 58628     0 
 736 1206   3   3  50  44   0 2019-04-14 16:27:22
 0  1            0       142420            0      7593400    0    0 58688     0 
 623 1099   1   4  50  45   0 2019-04-14 16:27:24
 0  1            0       247016            0      7488184    0    0 58568     0 
 649 1113   1   4  50  45   0 2019-04-14 16:27:26
 0  1            0       123232            0      7612088    0    0 58412   215 
 675 1141   2   3  50  46   0 2019-04-14 16:27:28
 0  2            0       144920            0      7586576    0    0 48376 11046 
 788 1455   1   5  34  60   0 2019-04-14 16:27:30
 1  1            0       125636            0      7595704    0    0 36736 21381 
 702 1386   1   4  21  74   0 2019-04-14 16:27:32
 0  3            0       156700            0      7559328    0    0 35556 23364 
 709 1367   1   3  22  74   0 2019-04-14 16:27:34
 0  2            0       315580            0      7382748    0    0 33608 24731 
 787 1407   1   5  18  76   0 2019-04-14 16:27:36
...
 0  2            0       684412            0      6152040    0    0 29832 28356 
 528  994   1   2  32  66   0 2019-04-14 16:38:04
 0  2            0       563512            0      6272264    0    0 29696 29506 
 546  987   1   2  32  65   0 2019-04-14 16:38:06
 0  2            0       595488            0      6241068    0    0 27292 30858 
 549  971   1   2  26  71   0 2019-04-14 16:38:08
 0  2            0       550120            0      6285352    0    0 28844 29696 
 567  995   1   2  29  68   0 2019-04-14 16:38:10
 1  1            0       432380            0      6402964    0    0 28992 29696 
 557  979   1   2  37  61   0 2019-04-14 16:38:12
 0  2            0       445796            0      6384412    0    0 26768 32134 
 628 1029   1   4  27  69   0 2019-04-14 16:38:14
 0  2            0       374972            0      6453592    0    0 28172 30839 
 529  962   1   2  43  54   0 2019-04-14 16:38:16
 0  2            0       317824            0      6507992    0    0 29172 29386 
 560 1001   1   3  27  68   0 2019-04-14 16:38:18
 0  3            0       215092            0      6609132    0    0 33116 25210 
 621 1148   1   3  19  77   0 2019-04-14 16:38:20
 0  2            0       194836            0      6621524    0    0 27786 30959 
 704 1152   0   5  18  77   0 2019-04-14 16:38:22
 0  3            0       315648            0      6500196    0    0 31434 27226 
 581 1073   0   3  31  65   0 2019-04-14 16:38:24
*0 2 0 256180 0 6554676 0 0 29828 29017 668 1174 0 4 20 76 0 2019-04-14 16:38:26* <<< CRASH
 0  1            0       378220            0      6552496    0    0  4348 53686 
2210 3816   1   5  46  49   0 2019-04-14 16:38:28
 0  1            0       389888            0      6536296    0    0  2704 56529 
2454 4178   0   5  42  52   0 2019-04-14 16:38:30
 0  2            0       923572            0      5998992    0    0  1612 56863 
2384 3928   0   6  16  78   0 2019-04-14 16:38:32
 0  0            0       908336            0      6006696    0    0  3584 49280 
8961 17334   0  19  39  42   0 2019-04-14 16:38:34
 0  1            0      1306480            0      5607088    0    0   264 63632 
18605 37933   3  58  35   4   0 2019-04-14 16:38:36
 2  1            0      1355448            0      5558576    0    0     8 59222 
14817 30296   2  46  24  27   0 2019-04-14 16:38:38
 2  2            0      1358224            0      5555884    0    0     0 58544 
14226 28331   2  44   3  50   0 2019-04-14 16:38:40
 2  1            0      1446348            0      5468748    0    0     0 58846 
14376 29185   2  44  11  42   0 2019-04-14 16:38:42
 0  0            0      2639648            0      4357608    0    0     0 28486 
12909 26770   2  44  49   5   0 2019-04-14 16:38:44
 0  0            0      2639524            0      4357800    0    0     0     0 
 158  154   0   0 100   0   0 2019-04-14 16:38:46
 0  0            0      2687316            0      4309976    0    0     0     0 
 181  188   0   2  98   0   0 2019-04-14 16:38:48
 0  0            0      2706920            0      4300116    0    0     0   105 
 137  263   0   0 100   0   0 2019-04-14 16:38:50
 0  0            0      2706672            0      4300232    0    0     0     0 
 142  204   0   0 100   0   0 2019-04-14 16:38:52
 0  0            0      2815116            0      4191928    0    0     0     0 
 116  242   0   0 100   0   0 2019-04-14 16:38:54
 0  0            0      2815364            0      4192008    0    0     0     0 
 116  239   0   0 100   0   0 2019-04-14 16:38:56
 0  0            0      2815116            0      4192164    0    0     0     0 
 159  236   0   0 100   0   0 2019-04-14 16:38:58

ending after the out of memory crash, that occurred exactly at the marked point 16:38:26.355 UTC.

We can't really see anything too worrisome. There is always lots of memory used by cache, which could have been mobilized. The only possible explanation I can think of is that in that moment of the crash the memory utilization suddenly skyrocketed in less than a second, so that the 2 second vmstat interval wouldn't show it??? Nah.

I have already much reduced work_mem, which has helped in some other cases before. Now I am going to reduce the shared_buffers now, but that seems counter-intuitive because we are sitting on all that cache memory unused!

Might this be a bug? It feels like a bug. It feels like those out of memory issues should be handled more gracefully (garbage collection attempt?) and that somehow there should be more information so the person can do anything about it.

Any ideas?

-Gunther

Reply via email to