Re: [HACKERS] Catalog/Metadata consistency during changeset extraction from wal

Robert Haas Tue, 26 Jun 2012 06:49:26 -0700

On Mon, Jun 25, 2012 at 3:17 PM, Andres Freund <and...@2ndquadrant.com> wrote:
>> I suppose the main reason we haven't done it already is that it
>> increases the period of time during which we're using 2X the disk
>> space.
> I find that an acceptable price if its optional. Making it such doesn't seem
> to be a problem for me.


+1.

>> I think there is absolutely nothing wrong with doing extra things in
>> ALTER TABLE when logical replication is enabled.  We've got code
>> that's conditional on Hot Standby being enabled in many places in the
>> system; why should logical replication be any different?  If we set
>> the bar for logical replication at "the system can't do anything
>> differently when logical replication is enabled" then I cheerfully
>> submit that we are doomed.  You've already made WAL format changes to
>> support logging the pre-image of the tuple, which is a hundred times
>> more likely to cause a performance problem than any monkeying around
>> we might want to do in ALTER TABLE.
>>
>> I am deeply skeptical that we need to look inside of transactions that
>> do full-table rewrites.  But even if we do, I don't see that what I'm
>> proposing precludes it.  For example, I think we could have ALTER
>> TABLE emit WAL records specifically for logical replication that allow
>> us to disentangle which tuple descriptor to use at which point in the
>> transaction.  I don't see that that would even be very difficult to
>> set up.
> Sorry, I was imprecise above: I have no problem doing some changes during
> ALTER TABLE if logical rep is enabled. I am worried though that to make that
> robust you would need loads of places that emit additional information:
> * ALTER TABLE
> * ALTER FUNCTIION
> * ALTER OPERATOR
> * ALTER/CREATE CAST
> * TRUNCATE
> * CLUSTER
> * ...
>
> I have the feeling that would we want to do that the full amount of required
> information would be rather high and end up being essentially the catalog.
> Just having an intermediate tupledesc doesn't help that much if you have e.g.
> record_out doing type lookups of its own.
>
> There also is the issue you have talked about before, that a user-type might
> depend on values in other tables. Unless were ready to break at least
> transactional behaviour for those for now...) I don't see how decoding outside
> of the transaction is ever going to be valid? I wouldn't have a big problem
> declaring that as broken for now...

I've been thinking about this a lot.  My thinking's still evolving
somewhat, but consider the following case.  A user defines a type T
with an input function I and and output function O.   They create a
table which uses type T and insert a bunch of data, which is duly
parsed using I; then, they replace I with a new input function I' and
O with a new output function O'.  Now, clearly, if we process the
inserts using the catalogs that were in effect at the time the inserts
we're done, we could theoretically get different output than if we use
the catalogs that were in effect after the I/O functions were
replaced.  But is the latter output wrong, or merely different?  My
first thought when we started talking about this was "it's wrong", but
the more I think about it, the less convinced I am...

...because it can't possibly be right to suppose that it's impossible
to decode heap tuples using any catalog contents other than the ones
that were in effect at the time the tuples got inserted.  If that were
true, then we wouldn't be able to read a table after adding or
dropping a column, which of course we can.  It seems to me that it
might be sufficient to guarantee that we'll decode using the same
*types* that were in effect at the time the inserts happened.  If the
user yanks the rug out from under us by changing the type definition,
maybe we simply define that as a situation in which they get to keep
both pieces.  After all, if you replace the type definition in a way
that makes sensible decoding of the table impossible, you've pretty
much shot yourself in the foot whether logical replication enters the
picture or not.

If the enum case, for example, we go to great pains to make sure that
the table contents are always decodable not only under the current
version of SnapshotNow, but also any successor version.  We do that by
prohibiting ALTER TYPE .. ADD VALUE from running inside a transaction
block - because if we inserted a row into pg_enum and then inserted
dependent rows into some user table, a rollback could leave us with
rows that we can't decode.  For the same reason, we don't allow ALTER
TYPE .. DROP VALUE.  I think that we can infer a general principle
from this: while I/O functions may refer to catalog contents, they may
not do so in a way that could be invalidated by subsequent commits or
rollbacks.  If they did, they'd be breaking the ability of subsequent
SELECT statements to read the table.

An interesting case that is arguably an exception to this rule is that
regwhatever types, which will cheerfully output their value as an OID
if it can't be decoded to text, but will bail on input if the textual
input can't be mapped to an OID via the appropriate system catalog.
The results don't even obey transaction semantics:

rhaas=# create table t (a int);
CREATE TABLE
rhaas=# begin;
BEGIN
rhaas=# select 't'::regclass;
 regclass
----------
 t
(1 row)

-- at this point, drop table t in another session
rhaas=# select 't'::regclass;
 regclass
----------
 t
(1 row)

But:

rhaas=# create table t (a int);
CREATE TABLE
rhaas=# begin;
BEGIN
-- at this point, drop table t in another session
rhaas=# select 't'::regclass;
ERROR:  relation "t" does not exist at character 8
STATEMENT:  select 't'::regclass;
ERROR:  relation "t" does not exist
LINE 1: select 't'::regclass;
               ^

So in this case, the I/O functions are dependent not only on the
contents of the system catalogs and what's visible under SnapshotNow,
but also on the contents of the backend-local system caches and the
exact timing of calls to AcceptInvalidationMessages() with respect to
concurrent activities in other sessions.  There is no way you're gonna
be able to figure that out from the WAL stream, which means that
there's no possible way to ensure that decode-to-text produces the
same value that the user actually typed (never mind that he could have
specified the input value as either an integer or a table name).  And
even if you could, the replication transaction could fail during
reencoding on the remote node.  To replicate this at all, you'd
probably have to decide on replication the underlying OID and forget
about the text representation... but it wouldn't bother me very much
to say "oh, that's a kooky internal type, you can't use it with
logical replication".  Unless there's some way of making sure that all
the remote OID assignments are the same as the local ones, it's not
going to be that meaningful anyway.  Even if you had such a thing, the
fact that the behavior can depend on leftover syscache contents means
that such a type might break under parallel apply.

So to get back to my main point: how about decreeing that logical
rep's responsibilities extend only to using the correct set of type
OIDs to decode the data, placing the burden of not whacking around the
type definitions or any ancillary tables on the user?  That seems to
reduce the complexity of the problem scenarios quite a lot, and also
seems generally sane.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Catalog/Metadata consistency during changeset extraction from wal

Reply via email to