Hi everyone, This mail contains the highlevel design description of how our prototype of in-core logical replication works. The individual patches will be posted as replies to this email. I obviously welcome all sorts of productive comments to both the individual patches and the architecture.
Unless somebody objects I will add most of the individual marked as RFC to the current commitfest. I hope that with comments stemming from that round we can get several of the patches into the first or second commitfest. As soon as the design is clear/accepted we will try very hard to get the following patches into the second or third round. If anybody disaggrees with the procedual way we try do this, please raise a hand now. I tried to find the right balance between keeping the description short enough that anybody will read the design docs and verbose enough that it is understandable. I can go into much more detail in any part if wanted. Please, keep in mind that those patches are *RFC* and a prototype and not intended to be applied as-is. There is a short description of the individual patches and their relevancy at the end of the email. Greetings, Andres ======== === Design goals for logical replication === : - in core - fast - async - robust - multi-master - modular - as unintrusive as possible implementation wise - basis for other technologies (sharding, replication into other DBMSs, ...) For reasons why we think this is an important set of features please check out the presentation from the in-core replication summit at pgcon: http://wiki.postgresql.org/wiki/File:BDR_Presentation_PGCon2012.pdf While you may argue that most of the above design goals are already provided by various trigger based replication solutions like Londiste or Slony, we think that thats not enough for various reasons: - not in core (and thus less trustworthy) - duplication of writes due to an additional log - performance in general (check the end of the above presentation) - complex to use because there is no native administration interface We want to emphasize that this proposed architecture is based on the experience of developing a minimal prototype which we developed with the above goals in mind. While we obviously hope that a good part of it is reusable for the community we definitely do *not* expect that the community accepts this +as-is. It is intended to be the basis upon which we, the community, can build and design the future logical replication. === Basic architecture === : Very broadly speaking there are several major pieces common to most approaches to replication: 1. Source data generation 2. Transportation of that data 3. Applying the changes 4. Conflict resolution 1.: As we need a change stream that contains all required changes in the correct order, the requirement for this stream to reflect changes across multiple concurrent backends raises concurrency and scalability issues. Reusing the WAL stream for this seems a good choice since it is needed anyway and adresses those issues already, and it further means that we don't incur duplicate writes. Any other stream generating componenent would introduce additional scalability issues. We need a change stream that contains all required changes in the correct order which thus needs to be synchronized across concurrent backends which introduces obvious concurrency/scalability issues. Reusing the WAL stream for this seems a good choice since it is needed anyway and adresses those issues already, and it further means we don't duplicate the writes and locks already performance for its maintenance. Unfortunately, in this case, the WAL is mostly a physical representation of the changes and thus does not, by itself, contain the necessary information in a convenient format to create logical changesets. The biggest problem is, that interpreting tuples in the WAL stream requires an up-to-date system catalog and needs to be done in a compatible backend and architecture. The requirement of an up-to-date catalog could be solved by adding more data to the WAL stream but it seems to be likely that that would require relatively intrusive & complex changes. Instead we chose to require a synchronized catalog at the decoding site. That adds some complexity to use cases like replicating into a different database or cross-version replication. For those it is relatively straight-forward to develop a proxy pg instance that only contains the catalog and does the transformation to textual changes. This also is the solution to the other big problem, the need to work around architecture/version specific binary formats. The alternative, producing cross-version, cross-architecture compatible binary changes or even moreso textual changes all the time seems to be prohibitively expensive. Both from a cpu and a storage POV and also from the point of implementation effort. The catalog on the site where changes originate can *not* be used for the decoding because at the time we decode the WAL the catalog may have changed from the state it was in when the WAL was generated. A possible solution for this would be to have a fully versioned catalog but that again seems to be rather complex and intrusive. For some operations (UPDATE, DELETE) and corner-cases (e.g. full page writes) additional data needs to be logged, but the additional amount of data isn't that big. Requiring a primary-key for any change but INSERT seems to be a sensible thing for now. The required changes are fully contained in heapam.c and are pretty simple so far. 2.: For transport of the non-decoded data from the originating site to the decoding site we decided to reuse the infrastructure already provided by walsender/walreceiver. We introduced a new command that, analogous to START_REPLICATION, is called START_LOGICAL_REPLICATION that will stream out all xlog records that pass through a filter. The on-the-wire format stays the same. The filter currently simply filters out all record which are not interesting for logical replication (indexes, freezing, ...) and records that did not originate on the same system. The requirement of filtering by 'origin' of a wal node comes from the planned multimaster support. Changes replayed locally that originate from another site should not replayed again there. If the wal is plainly used without such a filter that would cause loops. Instead we tag every wal record with the "node id" of the site that caused the change to happen and changes with a nodes own "node id" won't get applied again. Currently filtered records get simply replaced by NOOP records and loads of zeroes which obviously is not a sensible solution. The difficulty of actually removing the records is that that would change the LSNs. We currently rely on those though. The filtering might very well get expanded to support partial replication and such in future. 3.: To sensibly apply changes out of the WAL stream we need to solve two things: Reassemble transactions and apply them to the target database. The logical stream from 1. via 2. consists out of individual changes identified by the relfilenode of the table and the xid of the transaction. Given (sub)transactions, rollbacks, crash recovery, subtransactions and the like those changes obviously cannot be individually applied without fully loosing the pretence of consistency. To solve that we introduced a module, dubbed ApplyCache which does the reassembling. This module is *independent* of the data source and of the method of applying changes so it can be reused for replicating into a foreign system or similar. Due to the overhead of planner/executor/toast reassembly/type conversion (yes, we benchmarked!) we decided against statement generation for apply. Even when using prepared statements the overhead is rather noticeable. Instead we decided to use relatively lowlevel heapam.h/genam.h accesses to do the apply. For now we decided to use only one process to do the applying, parallelizing that seems to be too complex for an introduction of an already complex feature. In our tests the apply process could keep up with pgbench -c/j 20+ generating changes. This will obviously heavily depend on the workload. A fully seek bound workload will definitely not scale that well. Just to reiterate: Plugging in another method to do the apply should be a relatively simple matter of setting up three callbacks to a different function (begin, apply_change, commit). Another complexity in this is how to synchronize the catalogs. We plan to use command/event triggers and the oid preserving features from pg_upgrade to keep the catalogs in-sync. We did not start working on that. 4.: While we started to think about conflict resolution/avoidance we did not start to work on it. We currently *cannot* handle conflicts. We think that the base features/architecture should be aggreed uppon before starting with it. Multimaster tests were done with sequences setup with INCREMENT 2 and different start values on the two nodes. === Current Prototype === The current prototype consists of a series of patches that are split in hopefully sensible and coherent parts to make reviewing of individual parts possible. Its also available in the 'cabal-rebasing' branch on git.postgresql.org/users/andresfreund/postgres.git . That branch will modify history though. 01: wakeup handling: reduces replication lag, not very interesting in this context 02: Add zeroRecPtr: not very interesting either 03: new syscache for relfilenode. This would benefit by some syscache experienced eyes 04: embedded lists: This is a general facility, general review appreciated 05: preliminary bgworker support: This is not ready and just posted as its preliminary work for the other patches. Simon will post a real patch soon 06: XLogReader: Review definitely appreciated 07: logical data additions for WAL: Review definitely appreciated, I do not expect fundamental changes 08: ApplyCache: Important infrastructure for the patch, review definitely appreciated 09: Wal Decoding: Decode WAL generated with wal_level=logical into an ApplyCache 10: WAL with 'origin node': This is another important base-piece for logical rep 11: WAL segment handling changes: If the basic idea of adding a node_id to the functions and adding a pg_lcr directory is acceptable the rest of the patch is fairly boring/mechanical 12: walsender/walreceiver changes: Implement transport/filtering of logical changes. Very relevant 13: shared memory/crash recovery state handling for logical rep: Very relevant minus the TODO's in the commit message 14: apply module: review appreciated 15: apply process: somewhat dependent on the preliminary changes in 05, general direction is visible, loads of detail work needed as soon as some design decisions are agreed uppon. 16: this document. Not very interesting after youve read it ;) -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (email@example.com) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers