Hi hackers, I was pinged off-list by a fellow -hackers denizen interested in the synchronous replay feature and wanting a rebased patch to test. Here it goes, just in time for a Commitfest. Please skip to the bottom of this message for testing notes.
In previous threads[1][2][3] I called this feature proposal "causal reads". That was a terrible name, borrowed from MySQL. While it is probably a useful term of art, for one thing people kept reading it as "casual", which it ain't, and more importantly this patch is only one way to achieve read-follows-write causal consistency. Several others are proposed or exist in forks (user managed wait-for-LSN, global transaction manager, ...). OVERVIEW For writers, it works a bit like RAID mirroring: when you commit a write transaction, it waits until the data has become visible on all elements of the array, and if an array element is not responding fast enough it is kicked out of the array. For readers, it's a little different because you're connected directly to the array elements (rather than going through a central controller), so it uses a system of leases allowing read transactions to know instantly and whether they are running on an element that is currently in the array and are therefore able to service synchronous_replay transactions, or should raise an error telling you to go and ask some other element. This is a design choice favouring read-mostly workloads at the expense of write transactions. Hot standbys' whole raison for existing is to move *some* read-only workloads off the primary server. This proposal is for users who are prepared to trade increased primary commit latency for a guarantee about visibility on the standbys, so that *all* read-only work could be moved to hot standbys. The guarantee is: When two transactions tx1, tx2 are run with synchronous_replay set to on and tx1 reports successful commit before tx2 begins, then tx1 is guaranteed either to see tx1 or to raise a new error 40P02 if it is run on a hot standby. I have joked that that error means "snapshot too young". You could handle it the same way you handle deadlocks and serialization failures: by retrying, except in this case you might want to avoid that node for a while. Note that this feature is concerned with transaction visibility. It is not concerned with transaction durability. It will happily kick all of your misbehaving or slow standbys out of the array so that you fall back to single-node commit durability. You can express your durability requirement (ie I must have have N copies of the data on disk before I tell any external party about a transaction) separately, by configuring regular synchronous replication alongside this feature. I suspect that this feature would be most popular with people who are already using regular synchronous replication though, because they already tolerate higher commit latency. STATUS Here's a quick summary of the status of this proposal as I see it: * Simon Riggs, as the committer most concerned with the areas this proposal touches -- namely streaming replication and specifically syncrep -- has not so far appeared to be convinced by the value of this approach, and has expressed a preference for pursuing client-side or middleware tracked LSN tokens exclusively. I am perceptive enough to see that failing to sell the idea to Simon is probably fatal to the proposal. The main task therefore is to show convincingly that there is a real use case for this high-level design and its set of trade-offs, and that it justifies its maintenance burden. * I have tried to show that there are already many users who route their read-only queries to hot standby databases (not just "reporting queries"), and libraries and tools to help people do that using heuristics like "logged in users need fresh data, so primary only" or "this session has written in the past N minutes, so primary only". This proposal would provide a way for those users to do something based on a guarantee instead of such flimsy heuristics. I have tried to show that the libraries used by Python, Ruby, Java etc to achieve that sort of load balancing should easily be able to handle finding read-only nodes, routing read-only queries and dealing with the new error. I do also acknowledge that such libraries could also be used to provide transparent read-my-writes support by tracking LSNs and injecting wait-for-LSN directives with alternative proposals, but that is weaker than a global reads-follow-writes guarantee and the difference can matter. * I have argued that token-based systems are in fact rather complicated[4] and by no means a panacea. As usual, there are a whole bunch of trade-offs. I suspect that this proposal AND fully user-managed causality tokens (no middleware) are both valuable sweet spots for a non-GTM system. * Ants Aasma pointed out that this proposal doesn't provide a read-follows-read guarantee. He is right, and I'm not sure to what extent that is a problem, but I also think token-based systems can probably only solve it with fairly high costs. * Dmitry Dolgov reported a bug causing the replication protocol to get corrupted on some OSs but not others[5]; could be uninitialised data or size/padding/layout thinko or other stupid problem. (Gee, it would be nice if the wire protocol writing and reading code were in reusable functions instead of open-coded in multiple places... the bug could be due to that). Unfortunately I haven't managed to track it down yet and haven't had time to get back to this in time for the Commitfest due to other work. Given the interest expressed by a reviewer to test this, which might result in that problem being figured out, I figured I might as well post the rebased patch anyway, and I will also have another look soon. * As Andres Freund pointed out, this currently lacks tests. It should be fairly easy to add TAP tests to exercise this code, in the style of the existing tests for replication. TESTING NOTES Set up some hot standbys, put synchronous_replay_max_lag = 2s in the primary's postgresql.conf, then set synchronous_replay = on in every postgresql.conf or at least in every session that you want to test with. Then generate various write workloads and observe the primary server's log as the leases are grant and revoke, or check the status in pg_stat_replication's replay_lag and sync_replay columns. Verify that you can't successfully run synchronous_replay = on transactions on standbys that don't currently have a lease, and that you can't trick it by cutting your network cables with scissors or killing random processes etc. You might want to verify my claims about clock drift and the synchronous_replay_lease_time, either mathematically or experimentally. Thanks for reading! [1] https://www.postgresql.org/message-id/flat/CAEepm%3D0n_OxB2_pNntXND6aD85v5PvADeUY8eZjv9CBLk%3DzNXA%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAEepm=0n_OxB2_pNntXND6aD85v5PvADeUY8eZjv9CBLk=z...@mail.gmail.com https://www.postgresql.org/message-id/flat/CAEepm%3D1iiEzCVLD%3DRoBgtZSyEY1CR-Et7fRc9prCZ9MuTz3pWg%40mail.gmail.com [3] https://www.postgresql.org/message-id/flat/CA%2BCSw_tz0q%2BFQsqh7Zx7xxF99Jm98VaAWGdEP592e7a%2BzkD_Mw%40mail.gmail.com [4] https://www.postgresql.org/message-id/CAEepm%3D0W9GmX5uSJMRXkpNEdNpc09a_OMt18XFhf8527EuGGUQ%40mail.gmail.com [5] https://www.postgresql.org/message-id/CAEepm%3D352uctNiFoN84UN4gtunbeTK-PBLouVe8i_b8ZPcJQFQ%40mail.gmail.com -- Thomas Munro http://www.enterprisedb.com
0001-Synchronous-replay-mode-for-avoiding-stale-reads--v5.patch
Description: Binary data