On 05/30/2014 06:27 PM, Andres Freund wrote:
On 2014-05-30 17:59:23 +0300, Heikki Linnakangas wrote:
One thorny issue came up in discussions with other hackers on this in PGCon:
When a transaction is committed asynchronously, it becomes visible to other
backends before the commit WAL record is flushed. With CSN-based snapshots,
the order that transactions become visible is always based on the LSNs of
the WAL records. This is a problem when there is a mix of synchronous and
If transaction A commits synchronously with commit LSN 1, and transaction B
commits asynchronously with commit LSN 2, B cannot become visible before A.
And we cannot acknowledge B as committed to the client until it's visible to
other transactions. That means that B will have to wait for A's commit
record to be flushed to disk, before it can return, even though it was an
I personally think that's annoying, but we can live with it. The most common
usage of synchronous_commit=off is to run a lot of transactions in that
mode, setting it in postgresql.conf. And it wouldn't completely defeat the
purpose of mixing synchronous and asynchronous commits either: an
asynchronous commit still only needs to wait for any already-logged
synchronous commits to be flushed to disk, not the commit record of the
asynchronous transaction itself.
I have a hard time believing that users won't hate us for such a
regression. It's pretty common to mix both sorts of transactions and
this will - by my guesstimate - dramatically reduce throughput for the
Yeah, it probably would. Not sure how many people would care.
For an asynchronous commit, we could store the current WAL flush
location as the commit LSN, instead of the location of the commit
record. That would break the property that LSN == commit order, but that
property is fundamentally incompatible with having async commits become
visible without flushing previous transactions. Or we could even make it
configurable, it would be fairly easy to support both behaviors.
* Logical decoding is broken. I hacked on it enough that it looks roughly
sane and it compiles, but didn't spend more time to debug.
I think we can live with it not working for the first few
iterations. I'll look into it once the patch has stabilized a bit.
* I expanded pg_clog to 64-bits per XID, but people suggested keeping
pg_clog as is, with two bits per commit, and adding a new SLRU for the
commit LSNs beside it. Probably will need to do something like that to avoid
bloating the clog.
It also influences how on-disk compatibility is dealt with. So: How are
you planning to deal with on-disk compatibility?
* Add some kind of backend-private caching of clog, to make it faster to
access. The visibility checks are now hitting the clog a lot more heavily
than before, as you need to check the clog even if the hint bits are set, if
the XID falls between xmin and xmax of the snapshot.
That'll hurt a lot in concurrent scenarios :/. Have you measured how
'wide' xmax-xmin usually is?
That depends entirely on the workload. The worst case is a mix of a
long-running transaction and a lot of short transaction. It could grow
to millions of transactions or more in that case.
I wonder if we could just copy a range of
values from the clog when we start scanning....
I don't think that's practical, if the xmin-xmax gap is wide.
Perhaps we should take the bull by the horns and make clog faster to
look up. If we e.g. mmapped the clog file into backend-private address
space, we could all the locking overhead of an SLRU. On platforms with
atomic 64-bit instructions, you could read the clog with just a memory
barrier. Even on other architectures, you'd only need a spinlock.
Sent via pgsql-hackers mailing list (firstname.lastname@example.org)
To make changes to your subscription: