Hi,
On 2017-06-15 15:06:43 -0700, Jeff Janes wrote:
> > Well, wal_writer_delay doesn't work if walwriter is in sleep mode, and
> > this afaics would allow wal writer to go into sleep mode with half a
> > page filled, and it'd not be woken up again until the page is filled.
> > No?
> >
>
> If it is taking the big sleep, then we always wake it up, since
> acd4c7d58baf09f.
>
> I didn't change that part. I only changed what we do if it not hibernating.
Right, I hadn't read enough of the context.
> It looks like only limited consolidation was happening, the number of kills
> invoked was more than half of the number of SetLatch. I think the reason
> is what when kill(owner_pid, SIGUSR1) is called, the kernel immediately
> suspends the calling process and gives the signalled process a chance to
> run in that time slice. The Wal Writer gets woken up, sees that it only
> has a partial page to write and decides not to do anything, but resets the
> latch. It does this fast enough that the subscription worker hasn't
> migrated CPUs yet.
I think part of the problem here is that latches signal the owning
process even if the owning process isn't actually sleeping - that's
obviously quite pointless. In cases where walwriter is busy, that
actually causes a lot of superflous interrupted syscalls, because it
actually never ends up waiting on the latch. There's also a lot of
superflous context switches. I think we can avoid doing so quite
easily, as e.g. with the attached patch. Could you check how much that
helps your benchmark?
> The first change made the WALWriter wake up and do a write and flush
> whenever an async commit occurred and there was a completed WAL page to be
> written. This way the hint bits could be set while the heap page was still
> hot, because they couldn't get set until the WAL covering the hinted-at
> transaction commit is flushed.
>
> The second change said we can set hint bits before the WAL covering the
> hinted-at transaction is flushed, as long the page LSN is newer than that
> (so the wal covering the hinted-at transaction commit must be flushed
> before the page containing the hint bit can be written).
>
> Then the third change makes the wal writer only write the full pages most
> of the times it is woken up, not flush them. It only flushes them once
> every wal_writer_delay or wal_writer_flush_after, regardless of how often
> it is woken up.
>
> It seems like the third change rendered the first one useless.
I don't think so. Isn't the walwriter writing out the contents of the
WAL is quite important because it makes room in wal_buffers for new WAL?
I suspect we actually should wake up wal-writer *more* aggressively, to
offload wal fsyncs from individual backends, even when they're not
committing. Right now we fsync whenever a segment is finished - we
really don't want to do that in backends that could do other useful
work. I suspect it'd be a good idea to remove that logic from
XLogWrite() and replace it with
if (ProcGlobal->walwriterLatch)
SetLatch(ProcGlobal->walwriterLatch);
> Wouldn't it
> better, and much simpler, just to have reverted the first change once the
> second one was done?
I think both can actually happen independently, no? It's pretty common
for the page lsn to be *older* than the corresponding commit. In fact
you'll always hit that case unless there's also concurrent writes also
touching said page.
> If that were done, would the third change still be
> needed? (It did seem to add some other features as well, so I'm going to
> assume we still want those...).
It's still useful, yes. It avoids flushing the WAL too often
(page-by-page when there's a lot of writes), which can eat up a lot of
IOPS on fast storage.
> But now the first change is even worse than useless, it is positively
> harmful. The only thing to stop it from waking the WALWriter for every
> async commit is this line:
>
> /* if we have already flushed that far, we're done */
> if (WriteRqstPtr <= LogwrtResult.Flush)
> return;
>
> Since LogwrtResult.Flush doesn't advance anymore, this condition doesn't
> becomes false due to us waking walwriter, it only becomes false once the
> timeout expires (which it would have done anyway with no help from us), or
> once wal_writer_flush_after is exceeded. So now every async commit is
> waking the walwriter. Most of those wake up do nothing (there is not a
> completely new patch to write), some write one completed page but don't
> flush anything, and very few do a flush, and that one would have been done
> anyway.
s/completely new patch/completely new page/?
In my opinion we actually *do* want to write (but not flush!) out such
pages, so I'm not sure I agree with that logic. Have to think about
this some more...
Greetings,
Andres Freund
>From 50a3e62b16b752837f5a388e3259fee2bca6c2ab Mon Sep 17 00:00:00 2001
From: Andres Freund <[email protected]>
Date: Sat, 24 Jun 2017 16:46:11 -0700
Subject: [PATCH 1/3] Don't signal latch owners if owner not waiting on latch.
Doing so can generate a lot of useless syscalls and context switches
in cases where the owner of the latch is busy doing actual work.
Discussion: https://postgr.es/m/CAMkU=1z6_k4me12vnirgv5qnu58+czqdnxd+pb5bzxnvovv...@mail.gmail.com
---
src/backend/storage/ipc/latch.c | 62 ++++++++++++++++++++++++++++++++++-------
src/include/storage/latch.h | 1 +
2 files changed, 53 insertions(+), 10 deletions(-)
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index 07b1364de8..c1a1a76d0a 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -221,6 +221,7 @@ InitLatch(volatile Latch *latch)
{
latch->is_set = false;
latch->owner_pid = MyProcPid;
+ latch->owner_waiting = false;
latch->is_shared = false;
#ifndef WIN32
@@ -268,6 +269,7 @@ InitSharedLatch(volatile Latch *latch)
latch->is_set = false;
latch->owner_pid = 0;
+ latch->owner_waiting = false;
latch->is_shared = true;
}
@@ -311,6 +313,7 @@ DisownLatch(volatile Latch *latch)
Assert(latch->owner_pid == MyProcPid);
latch->owner_pid = 0;
+ latch->owner_waiting = false; /* paranoia */
}
/*
@@ -466,7 +469,17 @@ SetLatch(volatile Latch *latch)
sendSelfPipeByte();
}
else
- kill(owner_pid, SIGUSR1);
+ {
+ /*
+ * If the owner of the latch is currently waiting on it, send signal
+ * to wake up that process. The read-barrier is necessary so we see
+ * an up-to-date value, and it pairs with the memory barrier in the
+ * path setting owner_waiting in WaitEventSetWait.
+ */
+ pg_read_barrier();
+ if (latch->owner_waiting)
+ kill(owner_pid, SIGUSR1);
+ }
#else
/*
@@ -954,6 +967,12 @@ WaitEventSetWait(WaitEventSet *set, long timeout,
* immediately, avoid blocking again. We don't attempt to report any
* other events that might also be satisfied.
*
+ * To avoid other processes signalling us if we're not waiting,
+ * wasting context switches, we first check whether the latch is
+ * already set, and only enable signalling from other processes after
+ * that. To avoid a race, we've to recheck whether the latch is set
+ * after asking to be woken up.
+ *
* If someone sets the latch between this and the
* WaitEventSetWaitBlock() below, the setter will write a byte to the
* pipe (or signal us and the signal handler will do that), and the
@@ -976,17 +995,37 @@ WaitEventSetWait(WaitEventSet *set, long timeout,
* ordering, so that we cannot miss seeing is_set if a notification
* has already been queued.
*/
- if (set->latch && set->latch->is_set)
+ if (set->latch)
{
- occurred_events->fd = PGINVALID_SOCKET;
- occurred_events->pos = set->latch_pos;
- occurred_events->user_data =
- set->events[set->latch_pos].user_data;
- occurred_events->events = WL_LATCH_SET;
- occurred_events++;
- returned_events++;
+ bool latch_set = false;
+
+ if (set->latch->is_set)
+ {
+ latch_set = true;
+ }
+ else
+ {
+ set->latch->owner_waiting = true;
+ pg_memory_barrier();
+ if (set->latch->is_set)
+ {
+ latch_set = true;
+ }
+ }
+
+ if (latch_set)
+ {
+ occurred_events->fd = PGINVALID_SOCKET;
+ occurred_events->pos = set->latch_pos;
+ occurred_events->user_data =
+ set->events[set->latch_pos].user_data;
+ occurred_events->events = WL_LATCH_SET;
+ occurred_events++;
+ returned_events++;
+
+ break;
+ }
- break;
}
/*
@@ -1016,6 +1055,9 @@ WaitEventSetWait(WaitEventSet *set, long timeout,
waiting = false;
#endif
+ if (set->latch)
+ set->latch->owner_waiting = false;
+
pgstat_report_wait_end();
return returned_events;
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index 73abfafec5..442344d914 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -112,6 +112,7 @@ typedef struct Latch
sig_atomic_t is_set;
bool is_shared;
int owner_pid;
+ int owner_waiting; /* int, so all platforms can set atomically */
#ifdef WIN32
HANDLE event;
#endif
--
2.13.1.392.g8d1b10321b.dirty
--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers