Recently I've been dismissing a lot of suggested changes to checkpoint
fsync timing without suggesting an alternative. I have a simple one in
mind that captures the biggest problem I see: that the number of
backend and checkpoint writes to a file are not connected at all.
We know that a 1GB relation segment can take a really long time to write
out. That could include up to 128 changed 8K pages, and we allow all of
them to get dirty before any are forced to disk with fsync.
Rather than second guess the I/O scheduling, I'd like to take this on
directly by recognizing that the size of the problem is proportional to
the number of writes to a segment. If you turned off fsync absorption
altogether, you'd be at an extreme that allows only 1 write before
fsync. That's low latency for each write, but terrible throughput. The
maximum throughput case of 128 writes has the terrible latency we get
reports about. But what if that trade-off was just a straight, linear
slider going from 1 to 128? Just move it to the latency vs. throughput
position you want, and see how that works out.
The implementation I had in mind was this one:
-Add an absorption_count to the fsync queue.
-Add a new latency vs. throughput GUC I'll call . Its default value is
-1 (or 0), which corresponds to ignoring this new behavior.
-Whenever the background write absorbs a fsync call for a relation
that's already in the queue, increment the absorption count.
-max_segment_absorb > 0, have the background writer scan for relations
where absorption_count > max_segment_absorb. When it finds one, call
fsync on that segment.
Note that it's possible for this simple scheme to be fooled when writes
are actually touching a small number of pages. A process that
constantly overwrites the same page is the worst case here. Overwrite
it 128 times, and this method would assume you've dirtied every page,
while only 1 will actually go to disk when you call fsync. It's
possible to track this better. The count mechanism could be replaced
with a bitmap of the 128 blocks, so that absorbs set a bit instead of
incrementing a count. My gut feel is that this is more complexity than
is really necessary here. If in fact the fsync is slimmer than
expected, paying for it too much isn't the worst problem to have here.
I'd like to build this myself, but if someone else wants to take a shot
at it I won't mind. Just be aware the review is the big part here. I
should be honest about one thing: I have zero incentive to actually work
on this. The moderate amount of sponsorship money I've raised for 9.4
so far isn't getting anywhere near this work. The checkpoint patch
review I have been doing recently is coming out of my weekend volunteer
And I can't get too excited about making this as my volunteer effort
when I consider what the resulting credit will look like. Coding is by
far the smallest part of work like this, first behind coming up with the
design in the first place. And both of those are way, way behind how
long review benchmarking takes on something like this. The way credit
is distributed for this sort of feature puts coding first, design not
credited at all, and maybe you'll see some small review credit for
benchmarks. That's completely backwards from the actual work ratio. If
all I'm getting out of something is credit, I'd at least like it to be
an appropriate amount of it.
Greg Smith 2ndQuadrant US g...@2ndquadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
Sent via pgsql-hackers mailing list (email@example.com)
To make changes to your subscription: