Re: Syncrep and improving latency due to WAL throttling

Andres Freund Thu, 26 Jan 2023 07:40:53 -0800

Hi,

On 2023-01-26 12:08:16 +0100, Tomas Vondra wrote:
> It's not clear to me how could it cause deadlocks, as we're not waiting
> for a lock/resource locked by someone else, but it's certainly an issue
> for uninterruptible hangs.


Maybe not. But I wouldn't want to bet on it. It's a violation of all kinds of
lock-ordering rules.


> > My best idea for how to implement this in a somewhat safe way would be for
> > XLogInsertRecord() to set a flag indicating that we should throttle, and set
> > InterruptPending = true. Then the next CHECK_FOR_INTERRUPTS that's allowed 
> > to
> > proceed (i.e. we'll not be in a critical / interrupts off section) can
> > actually perform the delay.  That should fix the hard deadlock danger and
> > remove most of the increase in lock contention.
> > 
> 
> The solution I've imagined is something like autovacuum throttling - do
> some accounting of how much "WAL bandwidth" each process consumed, and
> then do the delay/sleep in a suitable place.

Where would such a point be, except for ProcessInterrupts()? Iteratively
adding a separate set of "wal_delay()" points all over the executor,
commands/, ... seems like a bad plan.


> > I don't think doing an XLogFlush() of a record that we just wrote is a good
> > idea - that'll sometimes significantly increase the number of fdatasyncs/sec
> > we're doing. To make matters worse, this will often cause partially filled 
> > WAL
> > pages to be flushed out - rewriting a WAL page multiple times can
> > significantly increase overhead on the storage level. At the very least 
> > this'd
> > have to flush only up to the last fully filled page.
> > 
> 
> I don't see why would that significantly increase the number of flushes.
> The goal is to do this only every ~1MB of a WAL or so, and chances are
> there were many (perhaps hundreds or more) commits in between. At least
> in a workload with a fair number of OLTP transactions (which is kinda
> the point of all this).

Because the accounting is done separately in each process. Even if you just
add a few additional flushes for each connection, in aggregate that'll be a
lot.


> And the "small" OLTP transactions don't really do any extra fsyncs,
> because the accounting resets at commit (or places that flush WAL).
> [...]
> > Just counting the number of bytes inserted by a backend will make the 
> > overhead
> > even worse, as the flush will be triggered even for OLTP sessions doing tiny
> > transactions, even though they don't contribute to the problem you're trying
> > to address.  How about counting how many bytes of WAL a backend has inserted
> > since the last time that backend did an XLogFlush()?
> > 
> 
> No, we should reset the counter at commit, so small OLTP transactions
> should not reach really trigger this. That's kinda the point, to only
> delay "large" transactions producing a lot of WAL.

I might have missed something, but I don't think the first version of patch
resets the accounting at commit?


> Same for the flushes of partially flushed pages - if there's enough of
> small OLTP transactions, we're already having this issue. I don't see
> why would this make it measurably worse.

Yes, we already have that problem, and it hurts. *Badly*. I don't see how v1
could *not* make it worse. Suddenly you get a bunch of additional XLogFlush()
calls to partial boundaries by every autovacuum, by every process doing a bit
more bulky work. Because you're flushing the LSN after a record you just
inserted, you're commonly not going to be "joining" the work of an already
in-process XLogFlush().


> But if needed, we can simply backoff to the last page boundary, so that we
> only flush the complete page. That would work too - the goal is not to flush
> everything, but to reduce how much of the lag is due to the current process
> (i.e. to wait for sync replica to catch up).

Yes, as I proposed...


> > I also suspect the overhead will be more manageable if you were to force a
> > flush only up to a point further back than the last fully filled page. We
> > don't want to end up flushing WAL for every page, but if you just have a
> > backend-local accounting mechanism, I think that's inevitably what you'd end
> > up with when you have a large number of sessions. But if you'd limit the
> > flushing to be done to synchronous_commit_flush_wal_after / 2 boundaries, 
> > and
> > only ever to a prior boundary, the amount of unnecessary WAL flushes would 
> > be
> > proportional to synchronous_commit_flush_wal_after.
> > 
> 
> True, that's kinda what I suggested above as a solution for partially
> filled WAL pages.

I think flushing only up to a point further in the past than the last page
boundary is somewhat important to prevent unnecessary flushes.


Greetings,

Andres Freund

Re: Syncrep and improving latency due to WAL throttling

Reply via email to