On Wed, May 01, 2019 at 07:12:52PM +0200, Alexander Kukushkin wrote:
Hi,
On Wed, 1 May 2019 at 17:02, Tomas Vondra <tomas.von...@2ndquadrant.com> wrote:
OK, so that seems like a separate issue, somewhat unrelated to the issue I
reported. And I'm not sure it's a walsender issue - it seems it might be a
psycopg2 issue in not reporting the flush properly, no?
Agree, it is a different issue, but I am unsure what to blame,
postgres or psycopg2.
Right now in the psycopg2 we confirm more or less every XLogData
message, but at the same time LSN on the primary is moving forward and
we get updates with keepalive messages.
I perfectly understand the need to periodically send updates of flush
= walEnd (which comes from keepalive). It might happen that there is
no transaction activity but WAL is still generated and as a result
replication slot will prevent WAL from being cleaned up.
From the client side perspective, it confirmed everything that it
should, but from the postgres side, this is not enough to shut down
cleanly. Maybe it is possible to change the check (sentPtr ==
replicatedPtr) to something like (lastMsgSentPtr <= replicatedPtr) or
it would be unsafe?
I don't know.
In general I think it's a bit strange that we're waiting for walsender
processes to catch up even in fast shutdown mode, instead of just aborting
them like other backends. But I assume there are reasons for that. OTOH it
makes us vulnerable to issues like this, when a (presumably) misbehaving
downstream prevents a shutdown.
>ptr=0x55cb958dca08, len=94) at be-secure.c:318
>#2 0x000055cb93aa7b87 in secure_write (port=0x55cb958d71e0,
>ptr=0x55cb958dca08, len=94) at be-secure.c:265
>#3 0x000055cb93ab6bf9 in internal_flush () at pqcomm.c:1433
>#4 0x000055cb93ab6b89 in socket_flush () at pqcomm.c:1409
>#5 0x000055cb93dac30b in send_message_to_frontend
>(edata=0x55cb942b4380 <errordata>) at elog.c:3317
>#6 0x000055cb93da8973 in EmitErrorReport () at elog.c:1481
>#7 0x000055cb93da5abf in errfinish (dummy=0) at elog.c:481
>#8 0x000055cb93da852d in elog_finish (elevel=13, fmt=0x55cb93f32de3
>"sending replication keepalive") at elog.c:1376
>#9 0x000055cb93bcae71 in WalSndKeepalive (requestReply=true) at
>walsender.c:3358
Is it stuck in the send() call forever, or did you happen to grab
this bracktrace?
No, it didn't stuck there. During the shutdown postgres starts sending
a few thousand keepalive messages per second and receives back so many
feedback message, therefore the chances of interrupting somewhere in
the send are quite high.
Uh, that seems a bit broken, perhaps?
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services