Hm. New theory: The current flush interface does the flushing inside FlushBuffer()->smgrwrite()->mdwrite()->FileWrite()->FlushContextSchedule(). The problem with that is that at that point we (need to) hold a content lock on the buffer!
You are worrying that FlushBuffer is holding a lock on a buffer and the "sync_file_range" call occurs is issued at that moment.
Although I agree that it is not that good, I would be surprise if that was the explanation for a performance regression, because the sync_file_range with the chosen parameters is an async call, it "advises" the OS to send the file, but it does not wait for it to be completed.
Moreover, for this issue to have a significant impact, it would require that another backend just happen to need this very buffer, but ISTM that the performance regression you are arguing about is on random IO bound performance, that is a few 100 tps in the best case, for very large bases, so a lot of buffers, so the probability of such a collision is very small, so it would not explain a significant regression.
Especially on a system that's bottlenecked on IO that means we'll frequently hold content locks for a noticeable amount of time, while flushing blocks, without any need to.
I'm not that sure it is really noticeable, because sync_file_range does not wait for completion.
Even if that's not the reason for the slowdowns I observed, I think this fact gives further credence to the current "pending flushes" tracking residing on the wrong level.
ISTM that I put the tracking at the level where is the information is available without having to recompute it several times, as the flush needs to know the fd and offset. Doing it differently would mean more code and translating buffer to file/offset several times, I think.
Also, maybe you could answer a question I had about the performance regression you observed, I could not find the post where you gave the detailed information about it, so that I could try reproducing it: what are the exact settings and conditions (shared_buffers, pgbench scaling, host memory, ...), what is the observed regression (tps? other?), and what is the responsiveness of the database under the regression (eg % of seconds with 0 tps for instance, or something like that).
-- Fabien. -- Sent via pgsql-hackers mailing list (firstname.lastname@example.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers