Hi, On 2023-01-26 12:08:16 +0100, Tomas Vondra wrote: > It's not clear to me how could it cause deadlocks, as we're not waiting > for a lock/resource locked by someone else, but it's certainly an issue > for uninterruptible hangs.
Maybe not. But I wouldn't want to bet on it. It's a violation of all kinds of lock-ordering rules. > > My best idea for how to implement this in a somewhat safe way would be for > > XLogInsertRecord() to set a flag indicating that we should throttle, and set > > InterruptPending = true. Then the next CHECK_FOR_INTERRUPTS that's allowed > > to > > proceed (i.e. we'll not be in a critical / interrupts off section) can > > actually perform the delay. That should fix the hard deadlock danger and > > remove most of the increase in lock contention. > > > > The solution I've imagined is something like autovacuum throttling - do > some accounting of how much "WAL bandwidth" each process consumed, and > then do the delay/sleep in a suitable place. Where would such a point be, except for ProcessInterrupts()? Iteratively adding a separate set of "wal_delay()" points all over the executor, commands/, ... seems like a bad plan. > > I don't think doing an XLogFlush() of a record that we just wrote is a good > > idea - that'll sometimes significantly increase the number of fdatasyncs/sec > > we're doing. To make matters worse, this will often cause partially filled > > WAL > > pages to be flushed out - rewriting a WAL page multiple times can > > significantly increase overhead on the storage level. At the very least > > this'd > > have to flush only up to the last fully filled page. > > > > I don't see why would that significantly increase the number of flushes. > The goal is to do this only every ~1MB of a WAL or so, and chances are > there were many (perhaps hundreds or more) commits in between. At least > in a workload with a fair number of OLTP transactions (which is kinda > the point of all this). Because the accounting is done separately in each process. Even if you just add a few additional flushes for each connection, in aggregate that'll be a lot. > And the "small" OLTP transactions don't really do any extra fsyncs, > because the accounting resets at commit (or places that flush WAL). > [...] > > Just counting the number of bytes inserted by a backend will make the > > overhead > > even worse, as the flush will be triggered even for OLTP sessions doing tiny > > transactions, even though they don't contribute to the problem you're trying > > to address. How about counting how many bytes of WAL a backend has inserted > > since the last time that backend did an XLogFlush()? > > > > No, we should reset the counter at commit, so small OLTP transactions > should not reach really trigger this. That's kinda the point, to only > delay "large" transactions producing a lot of WAL. I might have missed something, but I don't think the first version of patch resets the accounting at commit? > Same for the flushes of partially flushed pages - if there's enough of > small OLTP transactions, we're already having this issue. I don't see > why would this make it measurably worse. Yes, we already have that problem, and it hurts. *Badly*. I don't see how v1 could *not* make it worse. Suddenly you get a bunch of additional XLogFlush() calls to partial boundaries by every autovacuum, by every process doing a bit more bulky work. Because you're flushing the LSN after a record you just inserted, you're commonly not going to be "joining" the work of an already in-process XLogFlush(). > But if needed, we can simply backoff to the last page boundary, so that we > only flush the complete page. That would work too - the goal is not to flush > everything, but to reduce how much of the lag is due to the current process > (i.e. to wait for sync replica to catch up). Yes, as I proposed... > > I also suspect the overhead will be more manageable if you were to force a > > flush only up to a point further back than the last fully filled page. We > > don't want to end up flushing WAL for every page, but if you just have a > > backend-local accounting mechanism, I think that's inevitably what you'd end > > up with when you have a large number of sessions. But if you'd limit the > > flushing to be done to synchronous_commit_flush_wal_after / 2 boundaries, > > and > > only ever to a prior boundary, the amount of unnecessary WAL flushes would > > be > > proportional to synchronous_commit_flush_wal_after. > > > > True, that's kinda what I suggested above as a solution for partially > filled WAL pages. I think flushing only up to a point further in the past than the last page boundary is somewhat important to prevent unnecessary flushes. Greetings, Andres Freund