Re: [HACKERS] [RFC][PATCH] Logical Replication/BDR prototype and architecture

Robert Haas Thu, 14 Jun 2012 09:18:15 -0700

On Wed, Jun 13, 2012 at 7:27 AM, Andres Freund <[email protected]> wrote:
> === Design goals for logical replication === :
> - in core
> - fast
> - async
> - robust
> - multi-master
> - modular
> - as unintrusive as possible implementation wise
> - basis for other technologies (sharding, replication into other DBMSs, ...)


I agree with all of these goals except for "multi-master".  I am not
sure that there is a need to have a multi-master replication solution
in core.  The big tricky part of multi-master replication is conflict
resolution, and that seems like an awful lot of logic to try to build
into core - especially given that we will want it to be extensible.

More generally, I would much rather see us focus on efficiently
extracting changesets from WAL and efficiently applying those
changesets on another server.  IMHO, those are the things that are
holding back the not-in-core replication solutions we have today,
particularly the first.  If we come up with a better way of applying
change-sets, well, Slony can implement that too; they are already
installing C code.  What neither they nor any other non-core solution
can implement is change-set extraction, and therefore I think that
ought to be the focus.

To put all that another way, I think it is a 100% bad idea to try to
kill off Slony, Bucardo, Londiste, or any of the many home-grown
solutions that are out there to do replication.  Even if there were no
technical place for third-party replication products (and I think
there is), we will not win many friends by making it harder to extend
and add on to the server.  If we build an in-core replication solution
that can be used all by itself, that is fine with me.  But I think it
should also be able to expose its constituent parts as a toolkit for
third-party solutions.

> While you may argue that most of the above design goals are already provided 
> by
> various trigger based replication solutions like Londiste or Slony, we think
> that thats not enough for various reasons:
>
> - not in core (and thus less trustworthy)
> - duplication of writes due to an additional log
> - performance in general (check the end of the above presentation)
> - complex to use because there is no native administration interface

I think that your parenthetical note "(and thus less trustworthy)"
gets at another very important point, which is that one of the
standards for inclusion in core is that it must in fact be trustworthy
enough to justify the confidence that users will place in it.  It will
NOT benefit the project to have two replication solutions in core, one
of which is crappy.  More, even if what we put in core is AS GOOD as
the best third-party solutions that are available, I don't think
that's adequate.  It has to be better.  If it isn't, there is no
excuse for preempting what's already out there.

I imagine you are thinking along similar lines, but I think that it
bears being explicit about.

> As we need a change stream that contains all required changes in the correct
> order, the requirement for this stream to reflect changes across multiple
> concurrent backends raises concurrency and scalability issues. Reusing the
> WAL stream for this seems a good choice since it is needed anyway and adresses
> those issues already, and it further means that we don't incur duplicate
> writes. Any other stream generating componenent would introduce additional
> scalability issues.

Agreed.

> We need a change stream that contains all required changes in the correct 
> order
> which thus needs to be synchronized across concurrent backends which 
> introduces
> obvious concurrency/scalability issues.
> Reusing the WAL stream for this seems a good choice since it is needed anyway
> and adresses those issues already, and it further means we don't duplicate the
> writes and locks already performance for its maintenance.

Agreed.

> Unfortunately, in this case, the WAL is mostly a physical representation of 
> the
> changes and thus does not, by itself, contain the necessary information in a
> convenient format to create logical changesets.

Agreed.

> The biggest problem is, that interpreting tuples in the WAL stream requires an
> up-to-date system catalog and needs to be done in a compatible backend and
> architecture. The requirement of an up-to-date catalog could be solved by
> adding more data to the WAL stream but it seems to be likely that that would
> require relatively intrusive & complex changes. Instead we chose to require a
> synchronized catalog at the decoding site. That adds some complexity to use
> cases like replicating into a different database or cross-version
> replication. For those it is relatively straight-forward to develop a proxy pg
> instance that only contains the catalog and does the transformation to textual
> changes.

The actual requirement here is more complex than "an up-to-date
catalog".  Suppose transaction X begins, adds a column to a table,
inserts a row, and commits.  That tuple needs to be interpreted using
the tuple descriptor that transaction X would see (which includes the
new column), NOT the tuple descriptor that some other transaction
would see at the same time (which won't include the new column).  In a
more complicated scenario, X might (1) begin, (2) start a
subtransaction that alters the table, (3) release the savepoint or
roll back to the save point, (4) insert a tuple, and (5) commit.  Now,
the correct tuple descriptor for interpreting the tuple inserted in
step (4) depends on whether step (3) was a release savepoint or a
rollback-to-savepoint.  How are you handling these (and similar but
more complex) cases?

Moreover, we will want in the future to allow some of the DDL changes
that currently require AccessExclusiveLock to be performed with a
lesser lock.  It is unclear to me that this will be practical as far
as adding columns goes, but it would be a shame if logical replication
were the thing standing in the way.  In that scenario, you might have:
transaction X begins a transaction, and adds a column; transaction Y
inserts a tuple which must be interpreted using the old tuple
descriptor (or maybe it's harmless to use the new one, since the extra
column will be interpreted as NULL anyway); transaction X inserts a
tuple (which MUST be interpreted using the new tuple descriptor); and
then X either commits or rolls back.  This might be too hypothetical
to worry about in detail, but it would at least be nice to have the
sense that we're not totally screwed if the locking rules for DDL
change someday.

> This also is the solution to the other big problem, the need to work around
> architecture/version specific binary formats. The alternative, producing
> cross-version, cross-architecture compatible binary changes or even moreso
> textual changes all the time seems to be prohibitively expensive. Both from a
> cpu and a storage POV and also from the point of implementation effort.

I think that if you can't produce a textual record of changes, you're
throwing away 75% of what people will want to do with this.  Being
able to replicate across architectures, versions, and even into
heterogeneous databases is the main point of having logical
replication, IMV.   Multi-master replication is nice to have, but IME
there is huge demand for a replication solution that doesn't need to
be temporarily replaced with something completely different every time
you want to do a database upgrade.

> The catalog on the site where changes originate can *not* be used for the
> decoding because at the time we decode the WAL the catalog may have changed
> from the state it was in when the WAL was generated. A possible solution for
> this would be to have a fully versioned catalog but that again seems to be
> rather complex and intrusive.

Yes, that seems like a non-starter.

> For some operations (UPDATE, DELETE) and corner-cases (e.g. full page writes)
> additional data needs to be logged, but the additional amount of data isn't
> that big. Requiring a primary-key for any change but INSERT seems to be a
> sensible thing for now. The required changes are fully contained in heapam.c
> and are pretty simple so far.

I think that you can handle the case where there are no primary keys
by simply treating the whole record as a primary key.  There might be
duplicates, but I think you can work around that by decreeing that
when you replay an UPDATE or DELETE operation you will update or
delete at most one record.  So if there are multiple exactly-identical
records in the target database, then you will just UPDATE or DELETE
exactly one of them (it doesn't matter which).  This doesn't even seem
particularly complicated.

> For transport of the non-decoded data from the originating site to the 
> decoding
> site we decided to reuse the infrastructure already provided by
> walsender/walreceiver.

I think this is reasonable.  Actually, I think we might want to
generalize this to a bunch of other stuff, too.  There could be many
reasons to want a distributed cluster of PostgreSQL servers with some
amount of interconnect.  We might want to think about renaming the
processes at some point, but that's probably putting the cart before
the horse just right now.

> We introduced a new command that, analogous to
> START_REPLICATION, is called START_LOGICAL_REPLICATION that will stream out 
> all
> xlog records that pass through a filter.
>
> The on-the-wire format stays the same. The filter currently simply filters out
> all record which are not interesting for logical replication (indexes,
> freezing, ...) and records that did not originate on the same system.
>
> The requirement of filtering by 'origin' of a wal node comes from the planned
> multimaster support. Changes replayed locally that originate from another site
> should not replayed again there. If the wal is plainly used without such a
> filter that would cause loops. Instead we tag every wal record with the "node
> id" of the site that caused the change to happen and changes with a nodes own
> "node id" won't get applied again.
>
> Currently filtered records get simply replaced by NOOP records and loads of
> zeroes which obviously is not a sensible solution. The difficulty of actually
> removing the records is that that would change the LSNs. We currently rely on
> those though.
>
> The filtering might very well get expanded to support partial replication and
> such in future.

This all seems to need a lot more thought.

> To sensibly apply changes out of the WAL stream we need to solve two things:
> Reassemble transactions and apply them to the target database.
>
> The logical stream from 1. via 2. consists out of individual changes 
> identified
> by the relfilenode of the table and the xid of the transaction. Given
> (sub)transactions, rollbacks, crash recovery, subtransactions and the like
> those changes obviously cannot be individually applied without fully loosing
> the pretence of consistency. To solve that we introduced a module, dubbed
> ApplyCache which does the reassembling. This module is *independent* of the
> data source and of the method of applying changes so it can be reused for
> replicating into a foreign system or similar.
>
> Due to the overhead of planner/executor/toast reassembly/type conversion (yes,
> we benchmarked!) we decided against statement generation for apply. Even when
> using prepared statements the overhead is rather noticeable.
>
> Instead we decided to use relatively lowlevel heapam.h/genam.h accesses to do
> the apply. For now we decided to use only one process to do the applying,
> parallelizing that seems to be too complex for an introduction of an already
> complex feature.
> In our tests the apply process could keep up with pgbench -c/j 20+ generating
> changes. This will obviously heavily depend on the workload. A fully seek 
> bound
> workload will definitely not scale that well.
>
> Just to reiterate: Plugging in another method to do the apply should be a
> relatively simple matter of setting up three callbacks to a different function
> (begin, apply_change, commit).

I think it's reasonable to do the apply using these low-level
mechanisms, but surely people will sometimes want to extract tuples as
text and do whatever with them.  This goes back to my comment about
feeling that we need a toolkit approach, not a one-size-fits-all
solution.

> Another complexity in this is how to synchronize the catalogs. We plan to use
> command/event triggers and the oid preserving features from pg_upgrade to keep
> the catalogs in-sync. We did not start working on that.

This strikes me as completely unacceptable.  People ARE going to want
to replicate data between non-identical schemas on systems with
unsynchronized OIDs.  And even if they weren't, relying on triggers to
keep things in sync is exactly the sort of kludge that has inspired
all sorts of frustration with our existing replication solutions.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [RFC][PATCH] Logical Replication/BDR prototype and architecture

Reply via email to