On Tue, Aug 13, 2019 at 05:04:35PM +0900, Michael Paquier wrote:
On Wed, Feb 13, 2019 at 01:51:47PM -0800, Andres Freund wrote:
I'm not yet sure that that's actually something that's supposed to
happen, I got to spend some time analysing how this actually
happens. Normally the contents of the slot should actually prevent it
from being removed (as they're newer than
ReplicationSlotsComputeLogicalRestartLSN()). I kind of wonder if that's
a bug in the drop logic in newer releases.

In the same context, could it be a consequence of 9915de6c which has
introduced a conditional variable to control slot operations?  This
could have exposed more easily a pre-existing race condition.
--

This is one of the remaining open items, and we don't seem to be moving
forward with it :-(

I'm willing to take a stab at it, but to do that I need a way to
reproduce it. Tom, you mentioned you've managed to reproduce it in a
qemu instance, but that it took some fiddling with qemu parmeters or
something. Can you share what exactly was necessary?

An observation about the issue - while we started to notice this after
Decemeber, that's mostly because the PANIC patch went it shortly before.
We've however seen the issue before, as Thomas Munro mentioned in [1].

Those reports are from August, so it's quite possible something in the
first CF upset the code. And there's only a single commit in 2018-07
that seems related to logical decoding / snapshots [2], i.e. f49a80c:

commit f49a80c481f74fa81407dce8e51dea6956cb64f8
Author: Alvaro Herrera <alvhe...@alvh.no-ip.org>
Date:   Tue Jun 26 16:38:34 2018 -0400

   Fix "base" snapshot handling in logical decoding

   ...

The other reason to suspect this is related is that the fix also made it
to REL_11_STABLE at that time, and if you check the buildfarm data [3],
you'll see 11 fails on nightjar too, from time to time.

This means it's not a 12+ only issue, it's a live issue on 11. I don't
know if f49a80c is the culprit, or if it simply uncovered a pre-existing
bug (e.g. due to timing).


[1] 
https://www.postgresql.org/message-id/CAEepm%3D0wB7vgztC5sg2nmJ-H3bnrBT5GQfhUzP%2BFfq-WT3g8VA%40mail.gmail.com

[2] https://commitfest.postgresql.org/18/1650/

[3] 
https://buildfarm.postgresql.org/cgi-bin/show_history.pl?nm=nightjar&br=REL_11_STABLE

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Reply via email to