Re: [HACKERS] Proposal: Commit timestamp

Theo Schlossnagle Sun, 04 Feb 2007 07:53:58 -0800


On Feb 4, 2007, at 10:06 AM, Jan Wieck wrote:

On 2/4/2007 3:16 AM, Peter Eisentraut wrote:
Jan Wieck wrote:
This is all that is needed for last update wins resolution. And as
said before, the only reason the clock is involved in this is sothat
nodes can continue autonomously when they lose connection without
conflict resolution going crazy later on, which it would do if they
were simple counters. It doesn't require microsecond synchronized
clocks and the system clock isn't just used as a Lamport timestamp.
Earlier you said that "one assumption is that all servers in themultimaster cluster are ntp synchronized", which already rung thealarm bells in me. Now that I read this you appear to requiresynchronization not on the microsecond level but on some level. Ithink that would be pretty hard to manage for an administrator,seeing that NTP typically cannot provide such guarantees.
Synchronization to some degree is wanted to avoid totallyunexpected behavior. The conflict resolution algorithm itself canperfectly fine live with counters, but I guess you wouldn't wantthe result of it. If you update a record on one node, then 10minutes later you update the same record on another node.Unfortunately, the nodes had no communication and because the firstnode is much busier, its counter is way advanced ... this wouldmean the 10 minutes later update would get lost in the conflictresolution when the nodes reestablish communication. They wouldhave the same data at the end, just not what any sane person wouldexpect.
This behavior will kick in whenever the cross node conflictingupdates happen close enough so that the time difference between theclocks can affect it. So if you update the logical same row on twonodes within a tenth of a second, and the clocks are more than thatapart, the conflict resolution can result in the older row tosurvive. Clock synchronization is simply used to minimize this.
The system clock is used only to keep the counters somewhatsynchronized in the case of connection loss to retain some degreeof "last update" meaning. Without that, continuing autonomouslyduring a network outage is just not practical.

A Lamport clock addresses this. It relies on a cluster-wise clocktick. While it could be based on the system clock, it would not bebased on more than one clock. The point of the lamport clock is thatthere is _a_ clock, not multiple ones.

One concept is to have a univeral clock that ticks forward (likeevery second) and each node orders all their transactions inside thesecond-granular tick. Then each commit would be like: {node,clocksecond, txn#} and each time the clock ticks forward, txn# isreset to zero. This gives you ordered txns that windowed in somecluster-wide acceptable window (1 second). However, this is totallybroken as NTP is entirely insufficient for this purpose because of avariety of forms of clock skew. As such, the timestamp should beincremented via cluster consensus (one token ring or the pulsegenerated by the leader of the current cluster membership quorom).

As the clock must be incremented clusterwide, the need for it to beinsync with the system clock (on any or all of the systems) isobviated. In fact, as you can't guarantee the synchronicity meansthat it can be confusing -- one expects a time-based clock to beaccurate to the time. A counter-based clock has no such expectations.


// Theo Schlossnagle
// CTO -- http://www.omniti.com/~jesus/
// OmniTI Computer Consulting, Inc. -- http://www.omniti.com/



---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

              http://www.postgresql.org/docs/faq

Re: [HACKERS] Proposal: Commit timestamp

Reply via email to