[HACKERS] Design proposal: fsync absorb linear slider

Greg Smith Mon, 22 Jul 2013 20:49:41 -0700

Recently I've been dismissing a lot of suggested changes to checkpointfsync timing without suggesting an alternative. I have a simple one inmind that captures the biggest problem I see: that the number ofbackend and checkpoint writes to a file are not connected at all.

We know that a 1GB relation segment can take a really long time to writeout. That could include up to 128 changed 8K pages, and we allow all ofthem to get dirty before any are forced to disk with fsync.

Rather than second guess the I/O scheduling, I'd like to take this ondirectly by recognizing that the size of the problem is proportional tothe number of writes to a segment. If you turned off fsync absorptionaltogether, you'd be at an extreme that allows only 1 write beforefsync. That's low latency for each write, but terrible throughput. Themaximum throughput case of 128 writes has the terrible latency we getreports about. But what if that trade-off was just a straight, linearslider going from 1 to 128? Just move it to the latency vs. throughputposition you want, and see how that works out.


The implementation I had in mind was this one:

-Add an absorption_count to the fsync queue.

-Add a new latency vs. throughput GUC I'll call . Its default value is-1 (or 0), which corresponds to ignoring this new behavior.

-Whenever the background write absorbs a fsync call for a relationthat's already in the queue, increment the absorption count.

-max_segment_absorb > 0, have the background writer scan for relationswhere absorption_count > max_segment_absorb. When it finds one, callfsync on that segment.

Note that it's possible for this simple scheme to be fooled when writesare actually touching a small number of pages. A process thatconstantly overwrites the same page is the worst case here. Overwriteit 128 times, and this method would assume you've dirtied every page,while only 1 will actually go to disk when you call fsync. It'spossible to track this better. The count mechanism could be replacedwith a bitmap of the 128 blocks, so that absorbs set a bit instead ofincrementing a count. My gut feel is that this is more complexity thanis really necessary here. If in fact the fsync is slimmer thanexpected, paying for it too much isn't the worst problem to have here.

I'd like to build this myself, but if someone else wants to take a shotat it I won't mind. Just be aware the review is the big part here. Ishould be honest about one thing: I have zero incentive to actually workon this. The moderate amount of sponsorship money I've raised for 9.4so far isn't getting anywhere near this work. The checkpoint patchreview I have been doing recently is coming out of my weekend volunteertime.

And I can't get too excited about making this as my volunteer effortwhen I consider what the resulting credit will look like. Coding is byfar the smallest part of work like this, first behind coming up with thedesign in the first place. And both of those are way, way behind howlong review benchmarking takes on something like this. The way creditis distributed for this sort of feature puts coding first, design notcredited at all, and maybe you'll see some small review credit forbenchmarks. That's completely backwards from the actual work ratio. Ifall I'm getting out of something is credit, I'd at least like it to bean appropriate amount of it.

--
Greg Smith   2ndQuadrant US    g...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Design proposal: fsync absorb linear slider

Reply via email to