Recently I've been dismissing a lot of suggested changes to checkpoint fsync timing without suggesting an alternative. I have a simple one in mind that captures the biggest problem I see: that the number of backend and checkpoint writes to a file are not connected at all.

We know that a 1GB relation segment can take a really long time to write out. That could include up to 128 changed 8K pages, and we allow all of them to get dirty before any are forced to disk with fsync.

Rather than second guess the I/O scheduling, I'd like to take this on directly by recognizing that the size of the problem is proportional to the number of writes to a segment. If you turned off fsync absorption altogether, you'd be at an extreme that allows only 1 write before fsync. That's low latency for each write, but terrible throughput. The maximum throughput case of 128 writes has the terrible latency we get reports about. But what if that trade-off was just a straight, linear slider going from 1 to 128? Just move it to the latency vs. throughput position you want, and see how that works out.

The implementation I had in mind was this one:

-Add an absorption_count to the fsync queue.

-Add a new latency vs. throughput GUC I'll call . Its default value is -1 (or 0), which corresponds to ignoring this new behavior.

-Whenever the background write absorbs a fsync call for a relation that's already in the queue, increment the absorption count.

-max_segment_absorb > 0, have the background writer scan for relations where absorption_count > max_segment_absorb. When it finds one, call fsync on that segment.

Note that it's possible for this simple scheme to be fooled when writes are actually touching a small number of pages. A process that constantly overwrites the same page is the worst case here. Overwrite it 128 times, and this method would assume you've dirtied every page, while only 1 will actually go to disk when you call fsync. It's possible to track this better. The count mechanism could be replaced with a bitmap of the 128 blocks, so that absorbs set a bit instead of incrementing a count. My gut feel is that this is more complexity than is really necessary here. If in fact the fsync is slimmer than expected, paying for it too much isn't the worst problem to have here.

I'd like to build this myself, but if someone else wants to take a shot at it I won't mind. Just be aware the review is the big part here. I should be honest about one thing: I have zero incentive to actually work on this. The moderate amount of sponsorship money I've raised for 9.4 so far isn't getting anywhere near this work. The checkpoint patch review I have been doing recently is coming out of my weekend volunteer time.

And I can't get too excited about making this as my volunteer effort when I consider what the resulting credit will look like. Coding is by far the smallest part of work like this, first behind coming up with the design in the first place. And both of those are way, way behind how long review benchmarking takes on something like this. The way credit is distributed for this sort of feature puts coding first, design not credited at all, and maybe you'll see some small review credit for benchmarks. That's completely backwards from the actual work ratio. If all I'm getting out of something is credit, I'd at least like it to be an appropriate amount of it.
Greg Smith   2ndQuadrant US   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support

Sent via pgsql-hackers mailing list (
To make changes to your subscription:

Reply via email to