On 7/3/13 9:39 AM, Andres Freund wrote:
I wonder how much of this could be gained by doing a sync_file_range(SYNC_FILE_RANGE_WRITE) (or similar) either while doing the original checkpoint-pass through the buffers or when fsyncing the files.
The fsync calls decomposing into the queued set of block writes. If they all need to go out eventually to finish a checkpoint, the most efficient way from a throughput perspective is to dump them all at once.
I'm not sure sync_file_range targeting checkpoint writes will turn out any differently than block sorting. Let's say the database tries to get involved in forcing a particular write order that way. Right now it's going to be making that ordering decision without the benefit of also knowing what blocks are being read. That makes it hard to do better than the OS, which knows a different--and potentially more useful in a ready-heavy environment--set of information about all the pending I/O. And it would be very expensive to made all the backends start sharing information about what they read to ever pull that logic into the database. It's really easy to wander down the path where you assume you must know more than the OS does, which leads to things like direct I/O. I am skeptical of that path in general. I really don't want Postgres to be competing with the innovation rate in Linux kernel I/O if we can ride it instead.
One idea I was thinking about that overlaps with a sync_file_range refactoring is simply tracking how many blocks have been written to each relation. If there was a rule like "fsync any relation that's gotten more than 100 8K writes", we'd never build up the sort of backlog that causes the worst latency issues. You really need to start tracking the file range there, just to fairly account for multiple writes to the same block. One of the reasons I don't mind all the work I'm planning to put into block write statistics is that I think that will make it easier to build this sort of facility too. The original page write and the fsync call that eventually flushes it out are very disconnected right now, and file range data seems the right missing piece to connect them well.
-- Greg Smith 2ndQuadrant US g...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (email@example.com) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers