Re: [DOCS] Replication - super-draft

Markus Schiltknecht Tue, 25 Jul 2006 07:56:38 -0700

Hello Chris,

thank you for putting this together. It seems to be the replicationguide we are looking for: an overview over the principles and solutionsof replication for PostgreSQL. Good work!

Some suggestions I came up with when reading: when preparing mypresentation I came to the conclusion that first introducing 'single- vsmulti-master' before 'sync vs async' is easier. Especially because ofthe 'after-commit-conflicting-transactions' problem that rises in async,multi-master solutions, YMMV.

Being the Postgres-R (8) guy, I of course have some additions to thatparagraph: You mention Postgres-R being only usable on rather oldreleases. Please also mention that I'm actively working on a productiongrade Postgres-R for an up-to-date PostgreSQL version. There already isa downloadable, working prototype.

Concerning Slony-II: AFAIK there is no mechanism to detect conflictsearlier for normal transactions (maybe for DDL or TABLE LOCKS orsomething, but mentioning that here is misleading). Such a distributedlocking mechanism would increase network traffic - and decrease totalperformance.

Because Slony-II and Postgres-R are that similar, they also share the'problems that emerged'. Because Slony-II and Postgres-R are coceptuallymostly identical, I'd propose to merge those two paragraphs. You mightwant to add that both projects were created independently, but theirauthors currently plan to join their work and efforts.

A nit-picking detail: I'm unsure about the correct english spelling ofsingle- and multi-master ('single master', 'single-master' or'singlemaster'), but please make sure to use a consistent spelling.


Regrards

Markus

Chris Browne wrote:

Here is a "real drafty beginning" to a set of docs on replication.

Bruce was soliciting this at the Code Sprint...  Nobody really promised it....

Here's a beginning to it.  It's based on my impressions; I don't care
how much it gets "hacked up" from here.  Hopefully it will be quicker
to start with this than to start from the beginning with something
else.


<chapter id="replication">
 <title> Replication </title>

  <indexterm><primary>replication</primary></indexterm>

  <para> People frequently ask about what replication options are
  available for <productname>PostgreSQL</productname>.
  Unfortunately, there are so many approaches and models to this
  that are useful for different purposes that things tend to get
  confusing.
  </para>

  <para> At perhaps the most primitive level, one might
  use <xref linkend="backup"> tools,
  whether <xref linkend="app-pgdump">
  or <xref linkend="continuous-archiving"> to create additional
  copies of databases.  This <emphasis>doesn't</emphasis> provide
  any way to keep the replicas up to date; to bring the state of
  things to a different point in time requires bringing up another
  copy.  There is no way, with these tools, for updates on
  a <quote>master</quote> system to automatically propagate to the
  replicas.</para>

<sect1> <title> Categorization of Replication Systems </title>

  <para> Looking at replication systems, there are a number of ways in
  which they may be viewed:

  <itemizedlist>

  <listitem><para> Synchronous versus asynchronous</para>

  <para>Synchronous systems are ones where updates must be
  accepted on all the databases before they are permitted
  to <command>COMMIT</command>. </para>

  <para> Asynchronous systems propagate updates to the other
  databases later.  This permits the possibility that one
  database may have data significantly behind others.  Whether
  or not being behind is acceptable or not will depend on the
  nature of the application.</para>
  </listitem>

  <listitem><para> Single master versus multimaster.</para>

  <para> That is, whether there is a single database
  considered <quote>master</quote>, where all update operations
  are required to be submitted, or the alternative, multimaster,
  where updates may be submitted to any of several
  databases.</para>

  <para> Multimaster replication is vastly more complex and
  expensive, because of the need to deal with the possibility of
  conflicting updates.  The simplest example of this is where a
  replicated database manages inventory; the question is, what
  happens when requests go to different database nodes
  requesting ` a particular piece of inventory?</para>

  <para> Synchronous multimaster replication introduces the need
  to distribute locks across the systems, which, in research work
  done with Postgres-R and Slony-II, has proven to be very
  expensive. </para>

  <para> Asynchronous multimaster replication introduces the
  possibility that conflicting updates will be accepted by
  multiple nodes, as they don't know, at <command>COMMIT</command>
  time, that the updates conflict.  It is then necessary to have
  some sort of conflict resolution system, which can't really be
  generalized as a generic database facility.  An instance of this
  that is commonly seen is in the <productname>PalmOS
  HotSync</productname> system; the <quote>general policy</quote>
  when conflicts are noticed is to allow both conflicting records
  to persist until a human can intervene.  That may be quite
  acceptable for an address book; it's <emphasis>not</emphasis>
  fine for OLTP systems. </para>

  </listitem>

  <listitem><para> Update capture methods  </para>

  <para> Common methods include having triggers on tables,
  capturing SQL statements, and capturing transaction log (WAL)
  updates </para>

  <itemizedlist>

  <listitem><para> Triggers, as used in eRServer and Slony-I,
  have the advantage of capturing updates at the end of
  processing when all column values have been finalized.  The
  use of transaction visibility (MVCC) and ordering can provide
  strong guarantees on consistency. </para>

<para> Of course, firing a trigger for each tuple update

    comes at a not inconsiderable cost: a statement that touches
    10,000 tuples will fire the trigger 10,000 times, and
      transform, on the subscriber, into 10,000 SQL
      statements.</para></listitem>

<listitem><para> Statement capture almost exactly reverses the

      issues, as compared to triggers.</para>

      <para> There are no strong guarantees on consistency: any
      sort of nondeterministic query can <quote>corrupt</quote>
      things by introducing differences between nodes.  Here are
      four examples of cases where naive statement capture is sure
      to get things wrong:</para>

      <itemizedlist>
        <listitem><para><command>INSERT INTO mytable (txntime, product, quantity, taxes, 
total) values (now(), 'AB-275', 10, 45, 250.00);</command></para>
        <para> Some replication systems parse the queries, replacing date requests 
with timestamps. </para>
        </listitem>
        <listitem><para><command>INSERT INTO table2 (random() *
          50);</command></para>
          <para> In this case, nondeterminism is fairly much the
          point!</para>
            </listitem>
            <listitem><para>Any use of sequnce values as defaults,
            particularly with per-connection value cacheing, will open
            up occasions for values to diverge between
            nodes.</para></listitem>
              <listitem><para><command>INSERT INTO tab1 (txn_type,
              tdate, quantity, units, price) SELECT * FROM tab2 ORDER BY
              txn_type limit 50;</command></para>

              <para> There are many variations on this which will turn
                out badly: </para>
                <itemizedlist>
                 <listitem><para>If there are default fields in tab1
                  that are set using sequences, the only way to even
                   hope for the same ordering is to have
                    an <command>ORDER BY</command> clause that ensures
                   identical ordering on both hosts.</para></listitem>
                     <listitem><para> If the ordering isn't a suitable
                      total ordering, the requests for data from tab2 may
                        find different data on different
                         hosts.</para></listitem>
                        <listitem><para>Columns with a default
                          of <function>now()</function> will be troublesome as
                           mentioned earlier, and this makes the problem harder
                             because unlike in the earlier query, where one 
might
                             substitute '2006-09-02 04:42:23-00'
                              for <function>now()</function>, this requires a
                              substantial rewriting of the 
query.</para></listitem>
                                </itemizedlist>
                                </listitem>
                                </itemizedlist>

                                </listitem>
  </itemizedlist>

  </listitem>

  </itemizedlist>

  </para>

</sect1>

<sect1> <title> PostgreSQL Replication Systems and Their Uses </title>

<para> Based on the preceding taxonomy, we may categorize various
replication systems, which should be helpful in determining what they
may be best used for, and whether they are compatible with
your <quote>use case.</quote></para>

<sect2><title> Slony-I</title>

<para> Slony-I is a single-master to multiple subscriber asynchronous
replication system that captures updates using triggers. </para>

<para> For many systems, it is not clear how to initialize replication
on a new node some time after a system has been set up in production.
Slony-I was specifically designed to provide the ability to introduce
new nodes without the need to interrupt activity on the master
node.  </para>

<para> It has, a particular merit, that, by only using components
internal to PostgreSQL, it is compatible with multiple versions of
PostgreSQL. This lends it especially to assisting at upgrading systems
from one version of PostgreSQL to another without requiring a long
outage. </para>

<para> It suffers from three particular problems:</para>

<itemizedlist>
<listitem><para> Despite improvements from earlier versions, it is fairly complex to 
configure and administer.</para></listitem>
<listitem><para> It can only replicate changes that can be captured using triggers. 
</para>
<para> There is a handling for sequences, which comes via polling, but Slony-I 
<emphasis>does not</emphasis> provide an automatic way to replicate other sorts of objects. 
</listitem>
<listitem><para> The handling of DDL changes is somewhat fragile, and exists as 
something of a bag on the side. </para>
<para> There has been loose discussion as to how to address that; useful 
comprehensive answers have not emerged.
</listitem>
</itemizedlist>

</sect2>

<sect2><title> pgpool </title>

  <para> <application>pgpool</application> was initially created by
  Tatsuo Isshii as a portable alternative to Java connection pool
  modules.  He subsequently observed that it wouldn't take very much
  effort to extend it to create a simple replication system: if it is
  forwarding SQL queries to a PostgreSQL instance, extending that to
  two databases is very straightforward. </para>

  <para> It suffers, by nature, from the problems associated with
  replicating using capture of SQL statements; any sort of
  nondeterminism in the replicated statements will cause the databases
  to diverge. </para>

  <para> On the other hand, it is very easy to install and configure;
  for users with simple requirements, that can suffice. </para>

  <para> A <application>pgpool-2</application> is under way which
  introduces a more sophisticated query parser to try to address the
  nondeterminism issues; that may limit ongoing support for the legacy
  version.</para>

</sect2>

<sect2> <title> PITR - Point In Time Recovery </title>

<para> If you have a database cluster that supports a large number of
database instances (<emphasis>e.g.</emhasis> - varying values for
PGDATABASE), connection-managing systems like pgpool and systems like
Slony-I which require a manager process for each database for each
node that is replicated will turn out quite badly.</para>

<para> For instance, if you have a database cluster that hosts 300
databases, as would be the case in a "web hosting" situation, for
Slony-I to replicate all of this data, it would have to have 300 slon
processes for each node.  </para>

<para> PITR is likely to be more suitable in this case; that doesn't
provide you with a usable replica running, but it can recover *all* of
the tables in *all* of the databases on the backend.</para>
 <sect2> <title> Postgres-R </title>

  <para> This has been a research project at McGill University,
  building a multimaster synchronous replication system which uses a
  group communications system (e.g. - Spread) to control propagation
  of update requests, which it captures via
  adding <quote>hooks</quote> to the database engine to detect
  changes. </para>

  <para> Being a research project, the key has been to learn about
  replication as opposed to provide a <quote> production
  grade </quote> replication system.  For a considerable period of
  time it was only at all usable on rather old releases of
  PostgreSQL. </para>

  <para> The handling of DDL changes has long been somewhat
  controversial; several attempts to implement DDL handlers have been
  made, none of which has <quote>stuck.</quote> </para>

</sect2>

<sect2> <title> Slony-II </title>

  <para> This project inherited directly from Postgres-R, with an
  intent to create a multimaster synchronous replication system atop a
  group communications system, but then to proceed to something more
  of <quote>production grade</quote>. </para>

  <para> The notable distinction from Postgres-R was that, in order to
  find conflicts earlier, and to diminish the amount of work needing
  to be done at the synchronization point, Slony-II would try to
  publish and promote lock requests as soon as possible. </para>

  <para> Unfortunately several problems emerged: </para>

  <itemizedlist>

  <listitem><para> The available open source group communications
  systems turn out to neither be fast enough nor reliable enough for
  the purpose. </para></listitem>

  <listitem><para> One of the goals was for there to be as little
  need as possible to modify applications to deal with
  replication. </para>

  <para> Unfortunately, there turn out to be some cases where
  competing updates (e.g. - for updates to account balances) would
  cause multimaster replication to reject transactions due to
  concurrency problems with high frequency. </para>
  </listitem>

  </itemizedlist>

  <para> Unless some fundamentally better group communications system
  emerges, it is unlikely that Slony-II can progress further any time
  soon. </para>

</sect2>

<sect2> <title> pgcluster </title>

<para> Nothing to say yet... </para> </sect2>
</sect1>


---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Re: [DOCS] Replication - super-draft

Reply via email to