Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

Greg Smith Sun, 14 Jul 2013 13:28:57 -0700

On 7/3/13 9:39 AM, Andres Freund wrote:

I wonder how much of this could be gained by doing a
sync_file_range(SYNC_FILE_RANGE_WRITE) (or similar) either while doing
the original checkpoint-pass through the buffers or when fsyncing the
files.

The fsync calls decomposing into the queued set of block writes. Ifthey all need to go out eventually to finish a checkpoint, the mostefficient way from a throughput perspective is to dump them all at once.

I'm not sure sync_file_range targeting checkpoint writes will turn outany differently than block sorting. Let's say the database tries to getinvolved in forcing a particular write order that way. Right now it'sgoing to be making that ordering decision without the benefit of alsoknowing what blocks are being read. That makes it hard to do betterthan the OS, which knows a different--and potentially more useful in aready-heavy environment--set of information about all the pending I/O.And it would be very expensive to made all the backends start sharinginformation about what they read to ever pull that logic into thedatabase. It's really easy to wander down the path where you assume youmust know more than the OS does, which leads to things like direct I/O.I am skeptical of that path in general. I really don't want Postgresto be competing with the innovation rate in Linux kernel I/O if we canride it instead.

One idea I was thinking about that overlaps with a sync_file_rangerefactoring is simply tracking how many blocks have been written to eachrelation. If there was a rule like "fsync any relation that's gottenmore than 100 8K writes", we'd never build up the sort of backlog thatcauses the worst latency issues. You really need to start tracking thefile range there, just to fairly account for multiple writes to the sameblock. One of the reasons I don't mind all the work I'm planning to putinto block write statistics is that I think that will make it easier tobuild this sort of facility too. The original page write and the fsynccall that eventually flushes it out are very disconnected right now, andfile range data seems the right missing piece to connect them well.


--
Greg Smith   2ndQuadrant US    g...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

Reply via email to