Re: [HACKERS] make async slave to wait for lsn to be replayed

Thomas Munro Wed, 14 Sep 2016 19:42:07 -0700

On Thu, Sep 1, 2016 at 2:16 AM, Ivan Kartyshov
<i.kartys...@postgrespro.ru> wrote:
> Hi hackers,
>
> Few days earlier I've finished my work on WAITLSN statement utility, so I’d
> like to share it.
> [...]
> Your feedback is welcome!
>
> [waitlsn_10dev.patch]


Hi Ivan,

Thanks for working on this.  Here are some general thoughts on the
feature, and an initial review.

+1 for this feature.  Explicitly waiting for a given commit to be
applied is one of several approaches to achieve "causal consistency"
for reads on replica nodes, and I think it will be very useful if
combined with a convenient way to get the values to wait for when you
run COMMIT.  This could be used either by applications directly, or by
middleware that somehow keeps track of dependencies between
transactions and inserts waits.

I liked the way Heikki Linnakangas imagined this feature[1]:

  BEGIN WAIT FOR COMMIT 1234 TO BE VISIBLE;

... or perhaps it could be spelled like this:

  BEGIN [isolation stuff] WAIT FOR COMMIT TOKEN <xxx> TIMEOUT <timeout>;

That allows waiting only at the start of a transaction, whereas your
idea of making a utility command would allow a single READ COMMITTED
transaction to wait multiple times for transactions it has heard about
through side channels, which may be useful.  Perhaps we could support
the same syntax in a stand alone statement inside a transaction OR as
part of a BEGIN ... statement.  Being able to do it as part of BEGIN
means that you can use this feature for single-snapshot transactions,
ie REPEATABLE READ and SERIALIZABLE (of course you can't use
SERIALIZABLE on hot standbys yet but that'll be fixed one day).
Otherwise you'd be waiting for the LSN in the middle of your
transaction but not be able to see the result because you don't take a
new snapshot.  Or maybe it's enough to use a standalone WAIT ...
statement inside a REPEATABLE READ or SERIALIZABLE transaction as long
as it's the first statement, and should be an error to do so any time
later?

I think working in terms of LSNs or XIDs explicitly is not a good
idea: encouraging clients to think in terms of anything other than
opaque 'commit tokens' seems like a bad idea because it limits future
changes.  For example, there is on-going discussion about introducing
CSNs (commit sequence numbers), and there are some related concepts
lurking in the SSI code; maybe we'd want to use those one day.  Do you
think it would make sense to have a concept of a commit token that is
a non-analysable string as far as clients are concerned, so that
clients are not encouraged to do anything at all with them except use
them in a WAIT FOR COMMIT TOKEN <xxx> statement?

INITIAL FEEDBACK ON THE PATCH

I didn't get as far as testing or detailed review because it has some
obvious bugs and compiler warnings which I figured we should talk
about first, and I also have some higher level questions about the
design.

gram.y:12882:15: error: assignment makes pointer from integer without
a cast [-Werror=int-conversion]
      n->delay = $3;

It looks like struct WaitLSNStmt accidentally has 'delay' as a pointer
to int.  Perhaps you want an int?  Maybe it would be useful to include
the units (millisecond, ms) in the name?

waitlsn.c: In function 'WLDisownLatch':
waitlsn.c:82:2: error: suggest parentheses around assignment used as
truth value [-Werror=parentheses]
  if (MyBackendId = state->backend_maxid)
  ^~

Pretty sure you want == here.

waitlsn.c: In function 'WaitLSNUtility':
waitlsn.c:153:17: error: initialization makes integer from pointer
without a cast [-Werror=int-conversion]
  int   tdelay = delay;
                 ^~~~~

Another place where I think you wanted an int but used a pointer to
int?  To fix that warning you need tdelay = *delay, but I think delay
should really not be taken by pointer at all.

@@ -6922,6 +6923,11 @@ StartupXLOG(void)
+ /*
+ * After update lastReplayedEndRecPtr set Latches in SHMEM array
+ */
+ WaitLSNSetLatch();
+

I think you should try to do this only after commit records are
replayed, not after every record.  Only commit records can make
transactions visible, and the raison d'être for this feature is to let
users wait for transactions they know about to become visible.  You
probably can't do it directly in xact_redo_commit though, because at
that point XLogCtl->lastReplayedEndRecPtr hasn't been updated yet so a
backend that wakes up might not see that it has advanced and go back
to sleep.  It is updated in the StartupXLOG loop after the redo
function runs.  That is the reason why WalRcvForceReply() is called
from there rather than in xact_redo_commit, to implement remote_apply
for replication.  Perhaps you need something similar?

+ tdelay -= (GetCurrentTimestamp() - timer);

You can't do arithmetic with TimestampTz like this.  Depending on
configure option --disable-integer-datetimes (which controls macro
HAVE_INT64_TIMESTAMP), it may be a floating point number of seconds
since the epoch, or an integer number of microseconds since the epoch.
It looks like maybe the above code assumes it works in milliseconds,
since you're using that to adjust your delay which is in milliseconds?

I would try to figure out how to express the logic you want with
TimestampTzPlusMilliseconds and TimestampDifference.  I'd probably do
something like compute the absolute timeout time with
TimestampTzPlusMilliseconds(GetCurrentTimestamp(), delay) at the start
and then compute the remaining delay each time through the latch wait
loop with TimestampDifference(GetCurrentTimestamp(), timeout, ...),
though that requires converting (seconds, microseconds) to
millisecond.

+void
+WaitLSNSetLatch(void)
+{
+ uint i;
+ for (i = 1; i < (state->backend_maxid+1); i++)
+ {
+ SpinLockAcquire(&state->l_arr[i].slock);
+ if (state->l_arr[i].pid != 0)
+ SetLatch(&state->l_arr[i].latch);
+ SpinLockRelease(&state->l_arr[i].slock);
+ }
+}

So your approach here is to let regular backends each have their own
slot indexed by backend ID, which seems good because it means that
they don't have to contend for a lock, but it's bad because it means
that the recovery process has to spin through the array looking for
backends to wake up every time it advances, and wake them all up no
matter whether they are interested in the current LSN or not.  That
means that they may get woken up many times before they see the value
they're waiting for.

Did you also consider a design where there would be a wait list/queue,
and the recovery process would wake up only those backends that are
waiting for LSNs <= the current replayed location?  That would make
the work for the recovery process cheaper (it just has to pop waiters
from one end of a list sorted by the LSN they're waiting for), and let
the waiting backends sleep without so many spurious wake-ups, but it
would create potential for contention between backends that are
calling WAITLSN at the same time because they all need to add
themselves to that queue which would involve some kind of mutex.  I
don't know if that would be better or not, but it's probably the first
way that I would have tried to do this.  See syncrep.c which does
something similar.

+static void
+WLOwnLatch(void)
+{
+ SpinLockAcquire(&state->l_arr[MyBackendId].slock);
+ OwnLatch(&state->l_arr[MyBackendId].latch);
+ is_latch_owned = true;
+ if (MyBackendId > state->backend_maxid)
+ state->backend_maxid += 1;
+ state->l_arr[MyBackendId].pid = MyProcPid;
+ SpinLockRelease(&state->l_arr[MyBackendId].slock);
+}

I'm a bit confused about state->backend_maxid.  It looks like you are
using that to limit the range of slots that WaitLSNSetLatch has to
scan.  Isn't it supposed to contain the highest MyBackendId that has
ever been seen?  You appear to be incrementing backend_maxid by one,
instead of recording the index of the highest slot in use, but then
WaitLSNSetLatch is using it to restrict the range of indexes.  Then
there is the question of the synchronisation of access to
backend_maxid.  You hold a spinlock in one arbitrary slot, but that
doesn't seem sufficient: another backend may also read it, compute a
new value and then write it, while holding a different spin lock.  Or
am I missing something?

+ if (delay > 0)
+ latch_events = WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH;
+ else
+ latch_events = WL_LATCH_SET | WL_POSTMASTER_DEATH;

Here you are using delay <= 0 as 'wait forever'.  I wonder if it would
be useful to have two different special values: one meaning 'wait
forever', and another meaning 'don't wait at all: if it's not applied
yet, then timeout immediately'.  In any case I'd consider using names
for special wait times and using those for clarity:
WAITLSN_INFINITE_WAIT, WAITLSN_NO_WAIT.

Later I'll have feedback on the error messages, documentation and
comments but let's talk just about the design and code for now.

I hope this is helpful!

[1] https://www.postgresql.org/message-id/5642FF8F.4080803%40iki.fi

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] make async slave to wait for lsn to be replayed

Reply via email to