Hmmm. What I understood is that the workloads that have some performance
regressions (regressions that I have *not* seen in the many tests I ran) are
not due to checkpointer IOs, but rather in settings where most of the writes
is done by backends or bgwriter.
As far as I can see you've not run many tests where the hot/warm data
set is larger than memory (the full machine's memory, not
Indeed, I think I ran some, but not many with such characteristics.
That quite drastically alters the performance characteristics here,
because you suddenly have lots of synchronous read IO thrown into the
If I understand this point correctly...
I would expect the overall performance to be abysmal in such a situation
because you get only intermixed *random* read and writes: As you point
out, synchroneous *random* reads (very slow), but on the write side the
IOs are mostly random as well on the checkpointer side because there is
not much to aggregate to get sequential writes.
Now why would that degrade performance significantly? For me it should
render the sorting/flushing less and less effective, and it would go back
to the previous performance levels...
Or maybe it only the flushing itself which degrades performance, as you
point out, because then you have some synchronous (synced) writes as well
as read, as opposed to just the reads before without the patch.
If this is indeed the issue, then the solution to avoid the regression is
*not* to flush so that the OS IO scheduler is less constrained in its job,
and can be slightly more effective (well, we talking of abysmal random IO
disk performance here, so effective would be between slightly more or less
very very very bad).
Maybe a trick could be not to aggregate and flush when buffers in the same
file are too much apart anyway, for instance, based on some threshold?
This can be implemented locally when deciding to merge buffer flushes or
not, and whether to flush or not, so it would fit the current code quite
Now my understanding of the sync_file_range call is that it is an advice
to flush the stuff, but it is still asynchronous in nature, so whether it
would impact performance that badly depends on the OS IO scheduler. Also,
I would like to check whether, under the "regressed performance" (in tps
term that you observed), pg is more or less responsive. It could be that
the average performance is better but pg is offline longer on fsync. In
which case, I would consider it better to have lower tps in such cases
*if* pg responsiveness is significantly improved.
Would you have these measures for the regression runs you observed?
Whether it's bgwriter or not I've not fully been able to establish, but
it's a working theory.
Ok, that is something to check for confirmation or infirmation.
Given the above discussion, I think my suggestion may be wrong: as the tps
is low because of random read/write accesses then not many buffers are
modified (so the bgwriter/backends won't need to make space), the
checkpointer does not have much to write (good), *but* all of it is random
I do not see the point of rewriting the checkpointer for them, although
obviously I agree that something has to be done also for the other
Rewriting the checkpointer and fixing the flush interface in a more
generic way aren't the same thing at all.
Hmmm, probably I misunderstood something in the discussion. It started
with an implementation strategy, but it derived to discussing a
performance regression. I aggree that these are two different subjects.
Sent via pgsql-hackers mailing list (email@example.com)
To make changes to your subscription: