Re: [HACKERS] Proposal: Commit timestamp

Markus Schiltknecht Mon, 05 Feb 2007 23:37:12 -0800

Hi,

Theo Schlossnagle wrote:

On Feb 4, 2007, at 1:36 PM, Jan Wieck wrote:
Obviously the counters will immediately drift apart based on thetransaction load of the nodes as soon as the network goes down. And inorder to avoid this "clock" confusion and wrong expectation, you'drather have a system with such a simple, non-clock based counter andaccept that it starts behaving totally wonky when the clusterreconnects after a network outage? I rather confuse a few people thanhaving a last update wins conflict resolution that basically rollsdice to determine "last".
If your cluster partition and you have hours of independent action andupon merge you apply a conflict resolution algorithm that has enormouseffect undoing portions of the last several hours of work on the nodes,you wouldn't call that "wonky?"

You are talking about different things. Async replication, as Jan isplanning to do, is per se "wonky", because you have to cope withconflicts by definition. And you have to resolve them by late-aborting atransaction (i.e. after a commit). Or put it another way: async MMreplication means continuing in disconnected mode (w/o quorum or somesuch) and trying to reconciliate later on. It should not matter if thedelay is just some milliseconds of network latency or three days (exceptof course that you probably have more data to reconciliate).

For sane disconnected (or more generally, partitioned) operation inmulti-master environments, a quorum for the dataset must beestablished. Now, one can consider the "database" to be the dataset.So, on network partitions those in "the" quorum are allowed to progresswith data modification and others only read.

You can do this to *prevent* conflicts, but that clearly belongs to theworld of sync replication. I'm doing this in Postgres-R: in case ofnetwork partitioning, only a primary partition may continue to processwriting transactions. For async replication, it does not make sense toprevent conflicts when disconnected. Async is meant to cope withconflicts. So as to be independent of network latency.

However, there is noreason why the dataset _must_ be the database and that multiple datasets_must_ share the same quorum algorithm. You could easily classifycertain tables or schema or partitions into a specific dataset and applya suitable quorum algorithm to that and a different quorum algorithm toother disjoint data sets.

I call that partitioning (among nodes). And it's applicable to sync aswell as async replication, while it makes more sense in sync replication.

What I'm more concerned about, with Jan's proposal, is the assumptionthat you always want to resolve conflicts by time (except for balances,for which we don't have much information, yet). I'd rather say that timedoes not matter much if your nodes are disconnected. And (especially inasync replication) you should prevent your clients from committing toone node and then reading from another, expecting to find your datathere. So why resolve by time? It only makes the user think you couldguarantee that order, but you certainly cannot.


Regards

Markus


---------------------------(end of broadcast)---------------------------
TIP 7: You can help support the PostgreSQL project by donating at

               http://www.postgresql.org/about/donate

Re: [HACKERS] Proposal: Commit timestamp

Reply via email to