Re: [HACKERS] make async slave to wait for lsn to be replayed

Thomas Munro Sun, 12 Mar 2017 20:37:34 -0700

On Fri, Mar 10, 2017 at 1:49 AM, Ivan Kartyshov
<i.kartys...@postgrespro.ru> wrote:
> Here I attached rebased patch waitlsn_10dev_v3 (core feature)
> I will leave the choice of implementation (core/contrib) to the discretion
> of the community.
>
> Will be glad to hear your suggestion about syntax, code and patch.


Hi Ivan,

Here is some feedback based on a first read-through of the v4 patch.
I haven't tested it yet.

First, I'll restate my view of the syntax-vs-function question:  I
think an fmgr function is the wrong place to do this, because it
doesn't work for our 2 higher isolation levels as mentioned.  Most
people probably use READ COMMITTED most of the time so the extension
version you've proposed is certainly useful for many people and I like
it, but I will vote against inclusion in core of any feature that
doesn't consider all three of our isolation levels, especially if
there is no way to extend support to other levels later.  I don't want
PostgreSQL to keep adding features that eventually force everyone to
use READ COMMITTED because they want to use feature X, Y or Z.

Maybe someone can think of a clever way for an extension to insert a
wait for a user-supplied LSN *before* acquiring a snapshot so it can
work for the higher levels, or maybe the hooks should go into core
PostgreSQL so that the extension can exist as an external project not
requiring a patched PostgreSQL installation, or maybe this should be
done with new core syntax that extends transaction commands.  Do other
people have views on this?

+ * Portions Copyright (c) 2012-2017, PostgresPro Global Development Group

This should say PostgreSQL.

+wl_lsn_updated_hook(void)
+{
+    uint i;
+    /*
+     * After update lastReplayedEndRecPtr set Latches in SHMEM array
+     */
+    if (counter_waitlsn % count_waitlsn == 0
+        || 
TimestampDifferenceExceeds(time_waitlsn,GetCurrentTimestamp(),interval_waitlsn))
+    {

Doesn't this mean that if you are waiting for LSN 1234, and the
primary sends that LSN and then doesn't send anything for another
hour, a standby waiting in pg_waitlsn is quite likely to skip that
update (either because of count_waitlsn or interval_waitlsn), and then
finish up waiting for a very long time?

I'm not sure if this is a good idea, but it's an idea:  You could keep
your update skipping logic, but make sure you always catch the
important case where recovery hits the end of WAL, by invoking your
callback from WaitForWALToBecomeAvailable.  It could have a boolean
parameter that means 'don't skip this one!'.  In other words, it's OK
to skip updates, but not if you know there is no more WAL available to
apply (since you have no idea how long it will be for more to arrive).

Calling GetCurrentTimestamp() at high frequency (after every WAL
record is replayed) may be a bad idea.  It's a slow system call on
some operating systems.  Can we use an existing timestamp for free,
like recoveryLastXTime?  Remembering that the motivation for using
this feature is to wait for *whole transactions* to be replayed and
become visible, there is no sensible reason to wait for random WAL
positions inside a transaction, so if you used that then you would
effectively skip all non-COMMIT records and also skip some COMMIT
records that are coming down the pipe too fast.

+static void
+wl_own_latch(void)
+{
+    SpinLockAcquire(&state->l_arr[MyBackendId].slock);
+    OwnLatch(&state->l_arr[MyBackendId].latch);
+    is_latch_owned = true;
+
+    if (state->backend_maxid < MyBackendId)
+        state->backend_maxid = MyBackendId;
+
+    state->l_arr[MyBackendId].pid = MyProcPid;
+    SpinLockRelease(&state->l_arr[MyBackendId].slock);
+}

What is the point of using extra latches for this?  Why not just use
the standard latch?  Syncrep and many other things do that.  I'm not
actually sure if there is ever a reason to create more latches in
regular backends.  SIGUSR1 will be delivered and set the main latch
anyway.

There are cases of specialised latches in the system, like the wal
receiver latch, and I'm reliably informed that that solves problems
like delivering a wakeup message without having to know which backend
is currently the wal receiver (a problem we might consider solving
today with a condition variable?)  I don't think anything like that
applies here.

+        for (i = 0; i <= state->backend_maxid; i++)
+        {
+            SpinLockAcquire(&state->l_arr[i].slock);
+            if (state->l_arr[i].pid != 0)
+                SetLatch(&state->l_arr[i].latch);
+            SpinLockRelease(&state->l_arr[i].slock);
+        }

Once we get through the update-skipping logic above, we hit this loop
which acquires  spinlocks for every backend one after another and sets
the latches of every backend, no matter whether they are waiting for
the LSN that has been applied.  Assuming we go with this
scan-the-whole-array approach, why not include the LSN waited for in
the array slots, so that we can avoid disturbing processes that are
waiting for a later LSN?

Could you talk a bit about the trade-off between this approach and a
queue based approach?  I would like to understand whether this really
is the best way to do it.

One way to use a queue would be
ConditionVariableBroadcast(&state->lsn_moved), and then waiters would
use a condition variable wait loop.  That would make your code much
simpler (you wouldn't even need your array of spinlocks + latches) but
would still have the problem of processes waking up even though
they're actually waiting for a later LSN.  Another choice would be to
create a custom wait list which actually holds the LSNs waited for in
sorted order, so that we wake up exactly the right processes, cheaply,
or in arbitrary order which makes insertion cheaper but search more
expensive.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] make async slave to wait for lsn to be replayed

Reply via email to