On Fri, Nov 13, 2015 at 1:16 AM, Simon Riggs <si...@2ndquadrant.com> wrote:
> On 11 November 2015 at 09:22, Thomas Munro <thomas.mu...@enterprisedb.com>
>> 1. Reader waits with exposed LSNs, as Heikki suggests. This is what
>> BerkeleyDB does in "read-your-writes" mode. It means that application
>> developers have the responsibility for correctly identifying transactions
>> with causal dependencies and dealing with LSNs (or whatever equivalent
>> tokens), potentially even passing them to other processes where the
>> transactions are causally dependent but run by multiple communicating
>> clients (for example, communicating microservices). This makes it
>> difficult to retrofit load balancing to pre-existing applications and (like
>> anything involving concurrency) difficult to reason about as applications
>> grow in size and complexity. It is efficient if done correctly, but it is
>> a tax on application complexity.
> Agreed. This works if you have a single transaction connected thru a pool
> that does statement-level load balancing, so it works in both session and
> transaction mode.
> I was in favour of a scheme like this myself, earlier, but have more
> thoughts now.
> We must also consider the need for serialization across sessions or
> In transaction pooling mode, an application could get assigned a different
> session, so a token would be much harder to pass around.
> 2. Reader waits for a conservatively chosen LSN. This is roughly what
>> MySQL derivatives do in their "causal_reads = on" and "wsrep_sync_wait =
>> 1" modes. Read transactions would start off by finding the current end
>> of WAL on the primary, since that must be later than any commit that
>> already completed, and then waiting for that to apply locally. That means
>> every read transaction waits for a complete replication lag period,
>> potentially unnecessarily. This is tax on readers with unnecessary waiting.
> This tries to make it easier for users by forcing all users to experience
> a causality delay. Given the whole purpose of multi-node load balancing is
> performance, referencing the master again simply defeats any performance
> gain, so you couldn't ever use it for all sessions. It could be a USERSET
> parameter, so could be turned off in most cases that didn't need it. But
> its easier to use than (1).
> Though this should be implemented in the pooler.
> 3. Writer waits, as proposed. In this model, there is no tax on readers
>> (they have zero overhead, aside from the added complexity of dealing with
>> the possibility of transactions being rejected when a standby falls behind
>> and is dropped from 'available' status; but database clients must already
>> deal with certain types of rare rejected queries/failures such as
>> deadlocks, serialization failures, server restarts etc). This is a tax on
> This would seem to require that all readers must first check with the
> master as to which standbys are now considered available, so it looks like
No -- in (3), that is this proposal, standbys don't check with the primary
when you run a transaction. Instead, the primary sends a constant stream
of authorizations (in the form of keepalives sent every
causal_reads_timeout / 2 in the current patch) to the standby, allowing it
to consider itself available for a short time into the future (currently
now + causal_reads_timeout - max_tolerable_clock_skew to be specific -- I
can elaborate on that logic in a separate email). At the start of a
transaction in causal reads mode (the first call to GetTransaction to be
specific), the standby knows immediately without communicating with the
primary whether it can proceed or must raise the error. In the happy case,
the reader simply compares the most recently received authorization's
expiry time with the system clock and proceeds. In the worst case, when
contact is lost between primary and standby, the primary must stall
causal_reads commits for causal_reads_timeout (see CausalReadsBeginStall).
Doing that makes sure that no causal reads commit can return (see
CausalReadsCommitCanReturn) before the lost standby has definitely started
raising the error for causal_reads queries (because its most recent
authorization has expired), in case it is still alive and handling requests
It is not at all like (2), which introduces a conservative wait at the
start of every read transaction, slowing all readers down. In (3), readers
don't wait, they run (or are rejected) as fast as possible, but instead the
primary has to do extra things. Hence my categorization of (2) as a 'tax
on readers', and of (3) as a 'tax on writers'. The idea is that a site
with a high ratio of reads to writes would prefer zero-overhead reads.
> The alternative is that we simply send readers to any standby and allow
> the pool to work out separately whether the standby is still available,
> which mostly works, but it doesn't handle sporadic slow downs on particular
> standbys very well (if at all).
This proposal does handle sporadic slowdowns on standbys: it drops them
from the set of available standbys if they don't apply fast enough, all the
while maintaining the guarantee. Though occurs to me that it probably
needs some kind of defence against too much flapping between available and
unavailable (maybe some kind of back off on the 'joining' phase that
standbys go through when they transition from unavailable to available in
the current patch, which I realize I haven't described yet -- but I don't
want to get bogged down in details, while we're talking about the 30,000
> I think we need to look at whether this does actually give us anything, or
> whether we are missing the underlying Heisenberg reality.