<Oops, stalled post, sorry wrong "From", resent..>


Hello Andres,

+               rc = posix_fadvise(context->fd, context->offset, [...]

I'm a bit wary that this might cause significant regressions on
platforms not supporting sync_file_range, but support posix_fadvise()
for workloads that are bigger than shared_buffers. Consider what happens
if the workload does *not* fit into shared_buffers but *does* fit into
the OS's buffer cache. Suddenly reads will go to disk again, no?

That is an interesting question!

My current thinking is "maybe yes, maybe no":-), as it may depend on the OS implementation of posix_fadvise, so it may differ between OS.

This is a reason why I think that flushing should be kept a guc, even if the sort guc is removed and always on. The sync_file_range implementation is clearly always very beneficial for Linux, and the posix_fadvise may or may not induce a good behavior depending on the underlying system.

This is also a reason why the default value for the flush guc is currently set to false in the patch. The documentation should advise to turn it on for Linux and to test otherwise. Or if Linux is assumed to be often a host, then maybe to set the default to on and to suggest that on some systems it may be better to have it off. (Another reason to keep it "off" is that I'm not sure about what happens with such HD flushing features on virtual servers).

Overall, I'm not pessimistic, because I've seen I/O storms on a FreeBSD host and it was as bad as Linux (namely the database and even the box was offline for long minutes...), and if you can avoid that having to read back some data may be not that bad a down payment.

The issue is largely mitigated if the data is not removed from shared_buffers, because the OS buffer is just a copy of already hold data. What I would do on such systems is to increase shared_buffers and keep flushing on, that is to count less on the system cache and more on postgres own cache.

Overall, I'm not convince that the practice of relying on the OS cache is a good one, given what it does with it, at least on Linux.

Now, if someone could provide a dedicated box with posix_fadvise (say FreeBSD, maybe others...) for testing that would allow to provide data instead of speculating... and then maybe to decide to change its default value.

--
Fabien.


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to