On Sun, Jun 25, 2017 at 2:36 AM, Simon Riggs <si...@2ndquadrant.com> wrote:
> I'm very happy that you are addressing this topic.
> I noticed you didn't put in links my earlier doubts about this
> specific scheme, though I can see doubts from myself and Heikki at
> least in the URLs. I maintain those doubts as to whether this is the
> right way forwards.
One technical problem was raised in the earlier thread by Ants Aasma
that I concede may be fatal to this design (see note below about
read-follows-read below), but I'm not sure. All the other discussion
seemed to be about trade-offs between writer-waits and reader-waits
schemes, both of which I still view as reasonable options for an end
user to have in the toolbox. Your initial reaction was:
> What we want is to isolate the wait only to people performing a write-read
> sequence, so I think it should be readers that wait.
I agree with you 100%, up to the comma. The difficulty is identifying
which transactions are part of a write-read sequence. An
application-managed LSN tracking system allows for waits to occur
strictly in reads that are part of a write-read sequence because the
application links them explicitly, and nobody is arguing that we
shouldn't support that for hard working expert users. But to support
applications that don't deal with LSNs (or some kind of "causality
tokens") explicitly I think we'll finish up having to make readers
wait for incidental later transactions too, not just the write that
your read is dependent on, as I'll show below. When you can't
identify write-read sequences perfectly, it comes down to a choice
between taxing writers or taxing readers, and I'm not sure that one
approach is universally better. Let me summarise my understanding of
I'm going to use this terminology:
synchronous replay = my proposal: ask the primary server to wait until
standbys have applied tx1, a bit like 9.6 synchronous_commit =
remote_apply, but with a system of leases for graceful failure.
causality tokens = the approach Heikki mentioned: provide a way for
tx1 to report its commit LSN to the client, then provide a way for a
client to wait for the LSN to be replayed before taking a snapshot for
tx1, tx2 = a pair of transactions with a causal dependency; we want
tx2 to see tx1 because tx1 caused tx2 in some sense so must be seen to
A poor man's causality token system can be cobbled together today with
pg_current_wal_lsn() and a polling loop that checks
pg_last_wal_replay_lsn(). It's a fairly obvious thing to want to do.
Several people including Heikki Linnakangas, Craig Ringer, Ants Aasma,
Ivan Kartyshov and probably many others have discussed better ways to
do that, and a patch for the wait-for-LSN piece appeared in a
recent commitfest. I reviewed Ivan's patch and voted -1 only
because it didn't work for higher isolation levels. If he continues
to develop that I will be happy to review and help get it into
committable shape, and if he doesn't I may try to develop it myself.
In short, I think this is a good tool to have in the toolbox and
PostgreSQL 11 should have it! But I don't think it necessarily
invalidates my synchronous replay proposal: they represent different
sets of trade-offs and might appeal to different users. Here's why:
To actually use a causality token system you either need a carefully
designed application that keeps track of causal dependencies and
tokens, in which case the developer works harder but can benefit from
from an asynchronous pipelining effect (by the time tx2 runs we hope
that tx1 has been applied, so neither transaction had to wait). Let's
call that "application-managed causality tokens". That's great for
those users -- let's make that possible -- but most users don't want
to write code like that. So I see approximately three choices for
transparent middleware (or support built into standbys), which I'll
name and describe as follows:
1. "Panoptic" middleware: Sees all queries so that it can observe
commit LSNs and inject wait directives into all following read-only
transactions. Since all queries now need to pass through a single
process, you have a new bottleneck, an extra network hop, and a single
point of failure so you'll probably want a failover system with
2. "Hybrid" middleware: The middleware (or standby itself) listens to
the replication stream so that it knows about commit LSNs (rather than
spying on commits). The primary server waits until all connected
middleware instances acknowledge commit LSNs before it releases
commits, and then the middleware inserts wait-for-LSN directives into
read-only transactions. Now there is no single point problem, but
writers are impacted too. I mention this design because I believe
this is conceptually similar to how Galera wsrep_sync_wait (AKA
wsrep_causal_reads) works. (I call this "hybrid" because it splits
the waiting between tx1 and tx2. Since it's synchronous, dealing with
failure gracefully is tricky, probably needing a lease system like SR.
I acknowledge that comparisons between our streaming replication and
Galera are slightly bogus because Galera is a synchronous multi-master
3. "Forward-only" middleware (insert better name): The middleware (or
standby itself) asks the primary server for the latest committed LSN
at the start of every transaction, and then tells the standby to wait
for that LSN to be applied.
There are probably some other schemes involving communicating
middleware instances, but I don't think they'll be better in ways that
matter -- am I wrong?
Here's a trade-off table:
SR AT PT HT FT
tx1 commit waits? yes no no yes no
tx2 snapshot waits? no yes yes yes yes
tx2 waits for incidental transactions? no no yes yes yes
tx2 has round-trip to primary? no no no no yes
can tx2's wait be pipelined? yes no* no* no*
SR = synchronous replay
AT = application-managed causality tokens
PT = panoptic middleware-managed causality tokens
HT = hybrid middleware-managed or standby-managed causality tokens
FT = forward-only middleware-managed causality tokens
*Note that only synchronous replay and application-managed causality
tokens track the actual causal dependency tx2->tx1. SR does it by
making tx1 wait for replay so that tx2 doesn't have to wait at all and
AT does it by making tx2 wait specifically for tx1 to be applied. PT,
HT and FT don't actually know anything about tx1, so they make every
read query wait until *all known transactions* are applied
("incidental transactions" above), throwing away the pipelining
benefits of causality token systems (hence "no*" above). I haven't
used it myself but I have heard that that is why read latency is a
problem on Galera with causal reads mode enabled: due to lack of
better information you have to wait for the replication system to
drain its current apply queue before every query is processed, even if
tx1 in your causally dependent transaction pair was already visible on
the current node.
So far I think that SR and AT are sweet spots. AT for people who are
prepared to juggle causality tokens in their applications and SR for
people who want to remain oblivious to all this stuff and who can
tolerate a reduction in single-client write TPS. I also think AT's
pipelining advantage over SR and SR's single-client TPS impact are
diminished if you also choose to enable syncrep for durability, which
isn't a crazy thing to want to do if you're doing anything important
with your data. The other models where all readers wait for
incidental transactions don't seem terribly attractive to me,
especially if the motivating premise of load balancing with read-only
replicas is (to steal a line) "ye [readers] are many, they are few".
One significant blow my proposal received in the last thread was a
comment from Ants about read-follows-read. What do you think? I
suspect the same problem applies to causality token based systems as
discussed so far (except perhaps FT, the slowest and probably least
acceptable to anyone). On the other hand, I think it's at least
possible to fix that problem with causality tokens. You'd have to
expose and capture the last-commit-LSN for every snapshot used in
every single read query, and wait for it at the moment ever following
read query takes a new snapshot. This would make AT even harder work
for developers, make PT even slower, and make HT unworkable (it only
knows about commits, not reads). I also suspect that a sizeable class
of applications cares a lot less about read-follows-read than
read-follows-write, but I could be wrong about that.
> This patch presumes we will load balance writes to a master and reads
> to a pool of standbys. How will we achieve that?
> 1. We decorate the application with additional info to indicate
> routing/write concerns.
> 2. We get middleware to do routing for us, e.g. pgpool style read/write
> The explicit premise of the patch is that neither of the above options
> are practical, so I'm unclear how this makes sense. Is there some use
> case that you have in mind that has not been fully described? If so,
> lets get it on the table.
I don't think that pgpool routing is impractical, just that it's not a
great place to put transparent causality token tracking for the
reasons I've explained above -- you'll introduce a ton of latency for
all readers because you can't tell which earlier transactions they
might be causally dependent on. I think it's also nice to be able to
support the in-process connection pooling and routing that many
application developers use to avoid extra hops, so it'd be nice to
avoid making pooling/routing/proxy servers strictly necessary.
> What I think we need is a joined up plan for load balancing, so that
> we can understand how it will work. i.e. explain the whole use case
> and how the solution works.
Here are some ways you could set a system up:
1. Use middleware like pgpool or pgbouncer-rr to route queries
automatically; this is probably limited to single-statement queries,
since multi-statement queries can't be judged by their first statement
alone. (Those types of systems could be taught to understand a
request for a connection with causal reads enabled, and look at the
current set of usable standbys by looking at the pg_stat_replication
2. Use the connection pooling inside your application server or
application framework/library: for example Hibernate, Django and
many other libraries offer ways to configure multiple database
connection pools and route queries appropriately at a fairly high
level. Such systems could probably be improved to handle 'synchronous
replay not available' errors by throwing away the connection and
retrying automatically on another connection, much as they do for
serialization failures and deadlocks.
3. Modify your application to deal with separate connection pools
directly wherever it runs database transactions.
Perhaps I'm not thinking big enough: I tried to come up with an
incremental improvement to PostgreSQL that would fix a problem that I
know people have with their current hot standby deployment. I
deliberately avoided proposing radical architectural projects such as
moving cluster management, discovery, proxying, pooling and routing
responsibilities into PostgreSQL. Perhaps those working on GTM type
systems which effectively present a seamless single system find this
whole discussion to be aiming too low and dealing with the wrong
> I'm especially uncomfortable with any approaches that treat all
> sessions as one pool. For me, a server should support multiple pools.
> Causality seems to be a property of a particular set of pools. e.g.
> PoolS1 supports causal reads against writes to PoolM1 but not PoolM2,
> yet PoolS2 does not provide causal reads against PoolM1 orPoolM2.
Interesting, but I don't immediately see any fundamental difficulty
for any of the designs discussed. For example, maybe tx1 should be
able to set synchronous_replay = <group name>, rather than just 'on',
to refer to a ground of standbys defined in some GUC.
Just by the way, while looking for references I found
PinningMasterSlaveRouter which provides a cute example of demand for
causal reads (however implemented) in the Django community:
It usually sends read-only transactions to standbys, but keeps your
web session temporarily pinned to the primary database to give you 15
seconds' worth of read-your-writes after each write transaction.
 https://github.com/yandex/django_replicated (for example; several
similar extensions exist)
Sent via pgsql-hackers mailing list (email@example.com)
To make changes to your subscription: