Theo Schlossnagle wrote:
On Feb 4, 2007, at 1:36 PM, Jan Wieck wrote:
Obviously the counters will immediately drift apart based on the
transaction load of the nodes as soon as the network goes down. And in
order to avoid this "clock" confusion and wrong expectation, you'd
rather have a system with such a simple, non-clock based counter and
accept that it starts behaving totally wonky when the cluster
reconnects after a network outage? I rather confuse a few people than
having a last update wins conflict resolution that basically rolls
dice to determine "last".
If your cluster partition and you have hours of independent action and
upon merge you apply a conflict resolution algorithm that has enormous
effect undoing portions of the last several hours of work on the nodes,
you wouldn't call that "wonky?"
You are talking about different things. Async replication, as Jan is
planning to do, is per se "wonky", because you have to cope with
conflicts by definition. And you have to resolve them by late-aborting a
transaction (i.e. after a commit). Or put it another way: async MM
replication means continuing in disconnected mode (w/o quorum or some
such) and trying to reconciliate later on. It should not matter if the
delay is just some milliseconds of network latency or three days (except
of course that you probably have more data to reconciliate).
For sane disconnected (or more generally, partitioned) operation in
multi-master environments, a quorum for the dataset must be
established. Now, one can consider the "database" to be the dataset.
So, on network partitions those in "the" quorum are allowed to progress
with data modification and others only read.
You can do this to *prevent* conflicts, but that clearly belongs to the
world of sync replication. I'm doing this in Postgres-R: in case of
network partitioning, only a primary partition may continue to process
writing transactions. For async replication, it does not make sense to
prevent conflicts when disconnected. Async is meant to cope with
conflicts. So as to be independent of network latency.
However, there is no
reason why the dataset _must_ be the database and that multiple datasets
_must_ share the same quorum algorithm. You could easily classify
certain tables or schema or partitions into a specific dataset and apply
a suitable quorum algorithm to that and a different quorum algorithm to
other disjoint data sets.
I call that partitioning (among nodes). And it's applicable to sync as
well as async replication, while it makes more sense in sync replication.
What I'm more concerned about, with Jan's proposal, is the assumption
that you always want to resolve conflicts by time (except for balances,
for which we don't have much information, yet). I'd rather say that time
does not matter much if your nodes are disconnected. And (especially in
async replication) you should prevent your clients from committing to
one node and then reading from another, expecting to find your data
there. So why resolve by time? It only makes the user think you could
guarantee that order, but you certainly cannot.
---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at