Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()

Michael Paquier Wed, 09 Mar 2016 23:22:55 -0800

On Thu, Mar 10, 2016 at 12:12 AM, Alvaro Herrera
<alvhe...@2ndquadrant.com> wrote:
> Aleksander Alekseev wrote:
>> pg_receivexlog: could not send replication command "START_REPLICATION":
>> out of memory pg_receivexlog: disconnected; waiting 5 seconds to try
>> again pg_receivexlog: starting log streaming at 0/1000000 (timeline 1)
>>
>> Breakpoint 1, getCopyStart (conn=0x610180, copytype=PGRES_COPY_BOTH,
>> msgLength=3) at fe-protocol3.c:1398 1398              const char
>> *errmsg = NULL;
>> ```
>>
>> Granted this behaviour is a bit better then the current one. But
>> basically it's the same infinite loop only with pauses and warnings. I
>> wonder if this is a behaviour we really want. For instance wouldn't it
>> be better just to terminate an application in out-of-memory case? "Let
>> it crash" as Erlang programmers say.
>
> Hmm.  It would be useful to retry in the case that there is a chance
> that the program releases memory and can continue later.  But if it will
> only stay there doing nothing other than retrying, then that obviously
> will not happen.  One situation where this might help is if the overall
> *system* is short on memory and we expect that situation to resolve
> itself after a while -- after all, if the system is so loaded that it
> can't allocate a few more bytes for the COPY message, then odds are that
> other things are also crashing and eventually enough memory will be
> released that pg_receivexlog can continue.


Yep, that's my assumption regarding that, at some point the system may
succeed, and I don't think that we should break the current behaviors
of pg_receivexlog and pg_recvlogical regarding that in the
back-branches. Now, note that without the patch we actually have the
same problem. Say if OOMs happen continuously in getCopyStart, with
COPY_BOTH libpq would attempt to read the next message continuously
and would keep failing. Except that in this case the caller has no
idea what is happening as things keep running in libpq itself.

> On the other hand, if the system is so loaded, perhaps it's better to
> "let it crash" and have it restart later -- presumably once the admin
> notices the problem and restarts it manually after cleaning up the mess.
>
> If all programs are well behaved and nothing crashes when OOM but they
> all retry instead, then everything will continue to retry infinitely and
> make no progress.  That cannot be good.

That's something we could take care of in those client utilities I
think with a new option like --maximum-retries or similar, but anyway
I think that's a different discussion. The patch I am proposing here
allows a client application to be made aware of OOM errors that
happen. If we don't do something about that first, something like
--maximum-retries would be useless for COPY_BOTH as the client will
never be made aware of the OOM that happened in libpq and would keep
looping inside libpq itself until some memory is freed.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] OOM in libpq and infinite loop with getCopyStart()

Reply via email to