On Sat, Jun 24, 2017 at 5:09 PM, Andres Freund <and...@anarazel.de> wrote:

> Hi,
> On 2017-06-15 15:06:43 -0700, Jeff Janes wrote:
> > It looks like only limited consolidation was happening, the number of
> kills
> > invoked was more than half of the number of SetLatch.  I think the reason
> > is what when kill(owner_pid, SIGUSR1) is called, the kernel immediately
> > suspends the calling process and gives the signalled process a chance to
> > run in that time slice.  The Wal Writer gets woken up, sees that it only
> > has a partial page to write and decides not to do anything, but resets
> the
> > latch.  It does this fast enough that the subscription worker hasn't
> > migrated CPUs yet.
> I think part of the problem here is that latches signal the owning
> process even if the owning process isn't actually sleeping - that's
> obviously quite pointless.  In cases where walwriter is busy, that
> actually causes a lot of superflous interrupted syscalls, because it
> actually never ends up waiting on the latch. There's also a lot of
> superflous context switches.  I think we can avoid doing so quite
> easily, as e.g. with the attached patch.  Could you check how much that
> helps your benchmark?

That patch doesn't make any improvement to the pgbench --unlogged-tables
benchmark.  I expected that, because this benchmark exposes the opposite
problem.  The wal writer isn't busy, it is constantly being woken up but
then immediately going back to sleep.

> > The first change made the WALWriter wake up and do a write and flush
> > whenever an async commit occurred and there was a completed WAL page to
> be
> > written.  This way the hint bits could be set while the heap page was
> still
> > hot, because they couldn't get set until the WAL covering the hinted-at
> > transaction commit is flushed.
> >
> > The second change said we can set hint bits before the WAL covering the
> > hinted-at transaction is flushed, as long the page LSN is newer than that
> > (so the wal covering the hinted-at transaction commit must be flushed
> > before the page containing the hint bit can be written).
> >
> > Then the third change makes the wal writer only write the full pages most
> > of the times it is woken up, not flush them.  It only flushes them once
> > every wal_writer_delay or wal_writer_flush_after, regardless of how often
> > it is woken up.
> >
> > It seems like the third change rendered the first one useless.
> I don't think so. Isn't the walwriter writing out the contents of the
> WAL is quite important because it makes room in wal_buffers for new WAL?

If we wanted to do that, I think the asynchronous committer should just
issue the write directly (unless wal_sync_method is open_datasync
or open_sync).  Writes are usually extremely fast and it not worth trying
to offload them to a background process, as your half of the signalling
takes longer than just doing the write yourself.  When that is not true and
writes start blocking due to extreme io congestion, them everything is
going to start blocking anyway (wal_buffers head will run into the tail,
and the tail can't advance because the writes are blocking) so while
offloading writes to a background process is then good in theory, it isn't
very effective.  If everyone is stuck, it doesn't matter which process is
directly stuck and which ones are indirectly stuck.

> I suspect we actually should wake up wal-writer *more* aggressively, to
> offload wal fsyncs from individual backends, even when they're not
> committing.  Right now we fsync whenever a segment is finished - we
> really don't want to do that in backends that could do other useful
> work.  I suspect it'd be a good idea to remove that logic from
> XLogWrite() and replace it with
>         if (ProcGlobal->walwriterLatch)
>                 SetLatch(ProcGlobal->walwriterLatch);
My understanding is that you can't start on a new file until the old file
is completely synced, because the book keeping can currently only handle
one file at a time.  So if you signal the wal writer to do the sync, you
would just end up immediately blocking on it to finish.  Getting around
that would require substantially more changes; we would need to implement a
queue of full log files not yet synced so that we could move on without
waiting.  If we had such a queue, then sure, this would be a good idea to
just wake the wal writer. But currently, it only matters for large
transactions (\copy, insert ... select, bulk updates).  With small
transactions, you can't fill up log files quickly enough for it to make
much of a difference.  So I think this is something to do, but it is
independent from what I am currently going on about.

> > Wouldn't it
> > better, and much simpler, just to have reverted the first change once the
> > second one was done?
> I think both can actually happen independently, no? It's pretty common
> for the page lsn to be *older* than the corresponding commit.  In fact
> you'll always hit that case unless there's also concurrent writes also
> touching said page.

That is true, but I don't see any residual regression when going from
pre-db76b1efbbab2441428 to 7975c5e0a992ae9a4-with-my-patch applied.  So I
think the hint bit issue only shows up when hitting the same page
aggressively, in which case the LSN does get advanced quickly enough that
db76b1efbba solves it.  Perhaps a different test case could show that
issue.  I also didn't see any improvement from the original 4de82f7d7c50a81e,
so maybe 8 CPUs just isn't enough to detect the problem with setting
hint-buts with permanent tables with synchronous_commit=false.

> > If that were done, would the third change still be
> > needed?  (It did seem to add some other features as well, so I'm going to
> > assume we still want those...).
> It's still useful, yes.  It avoids flushing the WAL too often
> (page-by-page when there's a lot of writes), which can eat up a lot of
> IOPS on fast storage.

But the only reason it would do that is because we keep waking it up.  If
we stop waking it up, it would stop flushing each page, even if it still
had the logic to flush for each page there.

> > But now the first change is even worse than useless, it is positively
> > harmful.  The only thing to stop it from waking the WALWriter for every
> > async commit is this line:
> >
> >         /* if we have already flushed that far, we're done */
> >         if (WriteRqstPtr <= LogwrtResult.Flush)
> >             return;
> >
> > Since LogwrtResult.Flush doesn't advance anymore, this condition doesn't
> > becomes false due to us waking walwriter, it only becomes false once the
> > timeout expires (which it would have done anyway with no help from us),
> or
> > once wal_writer_flush_after is exceeded.  So now every async commit is
> > waking the walwriter.  Most of those wake up do nothing (there is not a
> > completely new patch to write), some write one completed page but don't
> > flush anything, and very few do a flush, and that one would have been
> done
> > anyway.
> s/completely new patch/completely new page/?

Yes.  My mental tab completion sometimes goes awry.

> In my opinion we actually *do* want to write (but not flush!) out such
> pages, so I'm not sure I agree with that logic.  Have to think about
> this some more...

I think if we want to write each page, we should do it in the foreground.
Conceptually, it could just be another enum value in synchronous_commit,
but synchronous_commit is getting rather unwieldy already.



Reply via email to