On 22 November 2016 at 10:20, Craig Ringer <cr...@2ndquadrant.com> wrote:

> I'm currently looking at making detection of replay conflict with a
> slot work by separating the current catalog_xmin into two effective
> parts - the catalog_xmin currently needed by any known slots
> (ProcArray->replication_slot_catalog_xmin, as now), and the oldest
> actually valid catalog_xmin where we know we haven't removed anything
> yet.

OK, more detailed plan.

The last checkpoint's oldestXid, and ShmemVariableCache's oldestXid,
are already held down by ProcArray's catalog_xmin. But that doesn't
mean we haven't removed newer tuples from specific relations and
logged that in xl_heap_clean, etc, including catalogs or user
catalogs, it only means the clog still exists for those XIDs. We don't
emit a WAL record when we advance oldestXid in
SetTransactionIdLimit(), and doing so is useless because vacuum will
have already removed needed tuples from needed catalogs before calling
SetTransactionIdLimit() from vac_truncate_clog(). We know that if
oldestXid is n, the true valid catalog_xmin where no needed tuples
have been removed must be >= n. But we need to know the lower bound of
valid catalog_xmin, which oldestXid doesn't give us.

So right now a standby has no way to reliably know if the catalog_xmin
requirement for a given replication slot can be satisfied. A standby
can't tell based on a xl_heap_cleanup_info record, xl_heap_clean
record, etc whether the affected table is a catalog or not, and
shouldn't generate conflicts for non-catalogs since otherwise it'll be
constantly clobbering walsenders.

A 2-phase advance of the global catalog_xmin would mean that
GetOldestXmin() would return a value from ShmemVariableCache, not the
oldest catalog xmin from ProcArray like it does now. (auto)vacuum
would then be responsible for:

* Reading the oldest catalog_xmin from procarray
* If it has advanced vs what's present in ShmemVariableCache, writing
a new xlog record type recording an advance of oldest catalog xmin
* advancing ShmemVariableCache's oldest catalog xmin

and would do so before it called GetOldestXmin via
vacuum_set_xid_limits() to determine what it can remove.

GetOldestXmin would return the ProcArray's copy of the oldest
catalog_xmin when in recovery, so we report it via hot_standby_fedback
to the upstream, it's recorded on our physical slot, and in turn
causes vacuum to advance the master's effective oldest catalog_xmin
for vacuum.

On the standby we'd replay the catalog_xmin advance record, advance
the standby's ShmemVariableCache's oldest catalog xmin, and check to
see if any replication slots, active or not, have a catalog_xmin <
than the new threshold. If none do, there's no conflict and we're
fine. If any do, we wait
max_standby_streaming_delay/max_standby_archive_delay as appropriate,
then generate recovery conflicts against all backends that have an
active replication slot based on the replication slot state in shmem.
Those backends - walsender or normal decoding backend - would promptly
die. New decoding sessions will check the ShmemVariableCache and
refuse to start if their catalog_xmin is < the threshold. Since we
advance it before generating recovery conflicts there's no race with
clients trying to reconnect after their backend is killed with a

If we wanted to get fancy we could set the latches of walsender
backends at risk of conflicting and they could check
ShmemVariableCache's oldest valid catalog xmin, so they could send
immediate keepalives with reply_requested set and hopefully get flush
confirmation from the client and advance their catalog_xmin before we
terminate them as conflicting with recovery. But that can IMO be done

Going to prototype this.

Alternate approach:

The oldest xid in heap_xlog_cleanup_info is relation-specific, but the
standby has no way to know if it's a catalog relation or not during
redo and know whether to kill slots and decoding sessions based on its
latestRemovedXid. Same for xl_heap_clean and the other records that
can cause snapshot conflicts (xl_xlog_visible, xl_heap_freeze_page,
xl_btree_delete xl_btree_reuse_page, spgxlogVacuumRedirect).

Instead of adding a 2-phase advance of the global catalog_xmin, we can
instead add a rider to each of these records that identifies whether
it's a catalog table or not. This would only be emitted when wal_level
>= logical, but it *would* increase WAL sizes a bit when logical
decoding is enabled even if it's not going to be used on a standby.
The rider would be a simple:

typedef struct xl_rel_catalog_info
    bool rel_accessible_from_logical_decoding;
} xl_catalog_info;

or similar. During redo we call a new
ResolveRecoveryConflictWithLogicalSlot function from each of those
records' redo routines that does what I outlined above.

This way add more info to more xlog records, and the upstream has to
use RelationIsAccessibleInLogicalDecoding() to set up the records when
writing the xlogs. In exchange, we don't have to add a new field to
CheckPoint or ShmemVariableCache or add a new xlog record type. It
seems the worse option to me.

(BTW, as comments on GetOldestSafeDecodingTransactionId() note, we
can't rely on KnownAssignedXidsGetOldestXmin() since it can be
incomplete at least on standby.)

 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to