On 2013-10-14 09:36:03 -0400, Robert Haas wrote:
> > I thought and implemented that in the beginning. Unfortunately it's not
> > enough :(. That's probably the issue that took me longest to understand
> > in this patchseries...
> >
> > Combocids can only fix the case where a transaction actually has create
> > a combocid:
> >
> > 1) TX1: INSERT id = 1 at 0/1: (xmin = 1, xmax=Invalid, cmin = 55, cmax = 
> > Invalid)
> > 2) TX2: DELETE id = 1 at 0/1: (xmin = 1, xmax=2, cmin = Invalid, cmax = 1)
> >
> > So, if we're decoding data that needs to lookup those rows in TX1 or TX2
> > we both times need access to cmin and cmax, but neither transaction will
> > have created a multixact. That can only be an issue in transaction with
> > catalog modifications.
> 
> Oh, yuck.  So that means you have to write an extra WAL record for
> EVERY heap insert, update, or delete to a catalog table?  OUCH.

So. As it turns out that solution isn't sufficient in the face of VACUUM
FULL and mixed DML/DDL transaction that have not yet been decoded.

To reiterate, as published it works like:
For every modification of catalog tuple (insert, multi_insert, update,
delete) that has influence over visibility issue a record that contains:
* filenode
* ctid
* (cmin, cmax)

When doing a visibility check on a catalog row during decoding of mixed
DML/DDL transaction lookup (cmin, cmax) for that row since we don't
store both for the tuple.

That mostly works great.

The problematic scenario is decoding a transaction that has done mixed
DML/DDL *after* a VACUUM FULL/CLUSTER has been performed. The VACUUM
FULL obviously changes the filenode and the ctid of a tuple, so we
cannot successfully do a lookup based on what we logged before.

I know of the following solutions:
1) Don't allow VACUUM FULL on catalog tables if wal_level = logical.
2) Make VACUUM FULL prevent DDL and then wait till all changestreams
   have decoded up to the current point.
3) don't delete the old relfilenode for VACUUM/CLUSTERs of system tables
   if there are life decoding slots around, instead delegate that
   responsibility to the slot management.
4) Store both (cmin, cmax) for catalog tuples.

I bascially think only 1) and 4) are realistic. And 1) sucks.

I've developed a prototype for 4) and except currently being incredibly
ugly, it seems to be the most promising approach by far. My trick to
store both cmin and cmax is to store cmax in t_hoff managed space when
wal_level = logical.
That even works when changing wal_level from < logical to logical
because only ever need to store both cmin and cmax for transactions that
have decodeable content - which they cannot yet have before wal_level =
logical.

This requires some not so nice things:
* A way to declare we're storing both. I've currently chosen
  HEAP_MOVED_OFF | HEAP_MOVED_IN. That sucks.
* A way for heap_form_tuple to know it should add the necessary space to
  t_hoff. I've added TupleDesc->tdhaswidecid for it.
* Fiddling with existing checks for HEAP_MOVED{,OFF,IN} to check for
  both set at the same time.
* Changing the WAL logging to (optionally?) transport the current
  CommandId instead of always resetting it InvalidCommandId.

The benefits are:
* Working VACUUM FULL
* Much simpler tqual.c logic, everything is stored in the row itself. No
  hash or something like that built.
* No more need to log (relfilenode, cmin, cmax) separately from heap
  changes itself anymore.

In the end, the costs are that individual catalog rows are 4 bytes
bigger iff wal_level = logical. That seems acceptable.

Some questions remain:
* Better idea for a flag than HEAP_MOVED_OFF | HEAP_MOVED_IN
* Should we just unconditionally log the current CommandId or make it
  conditional. We have plenty of flag space to signal whether it's
  present, but it's just 4 bytes.

Comments?

Greetings,

Andres Freund

-- 
 Andres Freund                     http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to