Re: [HACKERS] logical replication - still unstable after all these months

Mark Kirkwood Thu, 01 Jun 2017 15:47:07 -0700

On 31/05/17 21:16, Petr Jelinek wrote:

On 29/05/17 23:06, Mark Kirkwood wrote:

On 29/05/17 23:14, Petr Jelinek wrote:

On 29/05/17 03:33, Jeff Janes wrote:

What would you want to look at?  Would saving the WAL from the master be
helpful?

Useful info is, logs from provider (mainly the logical decoding logs
that mention LSNs), logs from subscriber (the lines about when sync
workers finished), contents of the pg_subscription_rel (with srrelid
casted to regclass so we know which table is which), and pg_waldump
output around the LSNs found in the logs and in the pg_subscription_rel
(+ few lines before and some after to get context). It's enough to only
care about LSNs for the table(s) that are out of sync.

I have a run that aborted with failure (accounts table md5 mismatch).
Petr - would you like to have access to the machine ? If so send me you
public key and I'll set it up.

Thanks to Mark's offer I was able to study the issue as it happened and
found the cause of this.

The busy loop in apply stops at the point when worker shmem state
indicates that table synchronization was finished, but that might not be
visible in the next transaction if it takes long to flush the final
commit to disk so we might ignore couple of transactions for given table
in the main apply because we think it's still being synchronized. This
also explains why I could not reproduce it on my testing machine (fast
ssd disk array where flushes were always fast) and why it happens
relatively rarely because it's one specific commit during the whole
synchronization process that needs to be slow.

So as solution I changed the busy loop in the apply to wait for
in-catalog status rather than in-memory status to make sure things are
really there and flushed.

While working on this I realized that the handover itself is bit more
complex than necessary (especially for debugging and for other people
understanding it) so I did some small changes as part of this patch to
make the sequences of states table goes during synchronization process
to always be the same. This might cause unnecessary update per one table
synchronization process in some cases but that seems like small enough
price to pay for clearer logic. And it also fixes another potential bug
that I identified where we might write wrong state to catalog if main
apply crashed while sync worker was waiting for status update.

I've been running tests on this overnight on another machine where I was
able to reproduce  the original issue within few runs (once I found what
causes it) and so far looks good.

I'm seeing a new failure with the patch applied - this time the historytable has missing rows. Petr, I'll put back your access :-)


regards

Mark



--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] logical replication - still unstable after all these months

Reply via email to