Hi all!

We just had a case of a very long running process of ours that creates does a lot of prepared statements through Perls DBD:Pg running into:

https://rt.cpan.org/Public/Bug/Display.html?id=88827

This resulted in millions of prepared statements created, but not removed in the affected backends over the course of 1-2 hours until the backends in question ran out of memory. The "out of memory" condition resulted in one memory context dump generated per occurance each consisting of >2M lines(!) (basically a line of CachedPlan/CachePlanSource per statement/function). In the 20 Minutes or it took monitoring to alert and operations to react this cause a followup incident because repeated "out of memory" conditions caused over 400M(!!) loglines amounting to some 15GB of data running the log partition dangerously close to full.

an example memory context dump looks like this:


TopMemoryContext: 582728880 total in 71126 blocks; 6168 free (52 chunks); 582722712 used TopTransactionContext: 8192 total in 1 blocks; 6096 free (1 chunks); 2096 used
    ExecutorState: 8192 total in 1 blocks; 5392 free (0 chunks); 2800 used
      ExprContext: 8192 total in 1 blocks; 8160 free (0 chunks); 32 used
    SPI Exec: 0 total in 0 blocks; 0 free (0 chunks); 0 used
    SPI Proc: 8192 total in 1 blocks; 5416 free (0 chunks); 2776 used
PL/pgSQL function context: 8192 total in 1 blocks; 1152 free (1 chunks); 7040 used PL/pgSQL function context: 24576 total in 2 blocks; 11400 free (1 chunks); 13176 used Type information cache: 24576 total in 2 blocks; 11888 free (5 chunks); 12688 used PL/pgSQL function context: 8192 total in 1 blocks; 1120 free (1 chunks); 7072 used PL/pgSQL function context: 24576 total in 2 blocks; 10984 free (1 chunks); 13592 used PL/pgSQL function context: 57344 total in 3 blocks; 29928 free (2 chunks); 27416 used PL/pgSQL function context: 57344 total in 3 blocks; 28808 free (2 chunks); 28536 used PL/pgSQL function context: 24576 total in 2 blocks; 5944 free (3 chunks); 18632 used RI compare cache: 24576 total in 2 blocks; 15984 free (5 chunks); 8592 used RI query cache: 24576 total in 2 blocks; 11888 free (5 chunks); 12688 used PL/pgSQL function context: 57344 total in 3 blocks; 31832 free (2 chunks); 25512 used PL/pgSQL function context: 57344 total in 3 blocks; 29600 free (2 chunks); 27744 used PL/pgSQL function context: 57344 total in 3 blocks; 39688 free (5 chunks); 17656 used
  CFuncHash: 8192 total in 1 blocks; 1680 free (0 chunks); 6512 used
Rendezvous variable hash: 8192 total in 1 blocks; 1680 free (0 chunks); 6512 used PLpgSQL function cache: 24520 total in 2 blocks; 3744 free (0 chunks); 20776 used Prepared Queries: 125886512 total in 25 blocks; 4764208 free (91 chunks); 121122304 used
  TableSpace cache: 8192 total in 1 blocks; 3216 free (0 chunks); 4976 used
Operator lookup cache: 24576 total in 2 blocks; 11888 free (5 chunks); 12688 used
  MessageContext: 8192 total in 1 blocks; 6976 free (0 chunks); 1216 used
Operator class cache: 8192 total in 1 blocks; 1680 free (0 chunks); 6512 used smgr relation table: 24576 total in 2 blocks; 5696 free (4 chunks); 18880 used TransactionAbortContext: 32768 total in 1 blocks; 32736 free (0 chunks); 32 used
  Portal hash: 8192 total in 1 blocks; 1680 free (0 chunks); 6512 used
  PortalMemory: 8192 total in 1 blocks; 7888 free (0 chunks); 304 used
    PortalHeapMemory: 1024 total in 1 blocks; 64 free (0 chunks); 960 used
ExecutorState: 57344 total in 3 blocks; 21856 free (2 chunks); 35488 used
        ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
        ExprContext: 8192 total in 1 blocks; 8160 free (1 chunks); 32 used
Relcache by OID: 24576 total in 2 blocks; 12832 free (3 chunks); 11744 used CacheMemoryContext: 42236592 total in 28 blocks; 7160904 free (298 chunks); 35075688 used
    CachedPlan: 7168 total in 3 blocks; 1544 free (0 chunks); 5624 used
CachedPlanSource: 7168 total in 3 blocks; 3904 free (1 chunks); 3264 used
    CachedPlan: 15360 total in 4 blocks; 6440 free (1 chunks); 8920 used
CachedPlanSource: 7168 total in 3 blocks; 1352 free (0 chunks); 5816 used
    CachedPlan: 15360 total in 4 blocks; 6440 free (1 chunks); 8920 used
CachedPlanSource: 7168 total in 3 blocks; 1352 free (0 chunks); 5816 used
    CachedPlan: 7168 total in 3 blocks; 1544 free (0 chunks); 5624 used
CachedPlanSource: 7168 total in 3 blocks; 3904 free (1 chunks); 3264 used
    CachedPlan: 15360 total in 4 blocks; 6440 free (1 chunks); 8920 used
CachedPlanSource: 7168 total in 3 blocks; 1352 free (0 chunks); 5816 used
    CachedPlan: 15360 total in 4 blocks; 6440 free (1 chunks); 8920 used
CachedPlanSource: 7168 total in 3 blocks; 1352 free (0 chunks); 5816 used
    CachedPlan: 7168 total in 3 blocks; 1544 free (0 chunks); 5624 used
CachedPlanSource: 7168 total in 3 blocks; 3904 free (1 chunks); 3264 used
    CachedPlan: 15360 total in 4 blocks; 6440 free (1 chunks); 8920 used
CachedPlanSource: 7168 total in 3 blocks; 1352 free (0 chunks); 5816 used
    CachedPlan: 15360 total in 4 blocks; 6440 free (1 chunks); 8920 used
CachedPlanSource: 7168 total in 3 blocks; 1352 free (0 chunks); 5816 used
    CachedPlan: 7168 total in 3 blocks; 1544 free (0 chunks); 5624 used
CachedPlanSource: 7168 total in 3 blocks; 3904 free (1 chunks); 3264 used
    CachedPlan: 15360 total in 4 blocks; 6440 free (1 chunks); 8920 used
CachedPlanSource: 7168 total in 3 blocks; 1352 free (0 chunks); 5816 used
    CachedPlan: 7168 total in 3 blocks; 2248 free (0 chunks); 4920 used
    CachedPlanSource: 3072 total in 2 blocks; 64 free (0 chunks); 3008 used
    CachedPlan: 15360 total in 4 blocks; 7576 free (0 chunks); 7784 used
CachedPlanSource: 7168 total in 3 blocks; 1888 free (0 chunks); 5280 used
    CachedPlan: 15360 total in 4 blocks; 7576 free (0 chunks); 7784 used
CachedPlanSource: 7168 total in 3 blocks; 1888 free (0 chunks); 5280 used
    CachedPlan: 7168 total in 3 blocks; 2248 free (0 chunks); 4920 used
    CachedPlanSource: 3072 total in 2 blocks; 64 free (0 chunks); 3008 used
    CachedPlan: 15360 total in 4 blocks; 7576 free (0 chunks); 7784 used

... goes of for some 2M lines...

given that an "out of memory" situation is already bad enough, but essentially mounting a DoS on the rest of the system due to creating excessive amounts of logs feels even worse.

I wonder if we should have a default of capping the dump to say 1k lines or such and only optionally do a full one. the other option would be to find an alternative way of displaying the information like if there is a large number of repeated entries just show the first X of them and say "123456 similiar ones".

This is 9.1.14 on Debian Wheezy/amd64 fwiw - but I dont think we have made relevant changes in more recent versions.



regards


Stefan


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to