On 09/11/2015 03:56 PM, Simon Riggs wrote:

The idea to do a partial pass through shared buffers and only write a
fraction of dirty buffers, then fsync them is a good one.

The key point is that we spread out the fsyncs across the whole
checkpoint period.

I doubt that's really what we want to do, as it defeats one of the purposes of spread checkpoints. With spread checkpoints, we write the data to the page cache, and then let the OS to actually write the data to the disk. This is handled by the kernel, which marks the data as expired after some time (say, 30 seconds) and then flushes them to disk.

The goal is to have everything already written to disk when we call fsync at the beginning of the next checkpoint, so that the fsync are cheap and don't cause I/O issues.

What you propose (spreading the fsyncs) significantly changes that, because it minimizes the amount of time the OS has for writing the data to disk in the background to 1/N. That's a significant change, and I'd bet it's for the worse.


I think we should be writing out all buffers for a particular file
in one pass, then issue one fsync per file. >1 fsyncs per file seems
a bad idea.

So we'd need logic like this
1. Run through shared buffers and analyze the files contained in there
2. Assign files to one of N batches so we can make N roughly equal sized
mini-checkpoints
3. Make N passes through shared buffers, writing out files assigned to
each batch as we go

What I think might work better is actually keeping the write/fsync phases we have now, but instead of postponing the fsyncs until the next checkpoint we might spread them after the writes. So with target=0.5 we'd do the writes in the first half, then the fsyncs in the other half. Of course, we should sort the data like you propose, and issue the fsyncs in the same order (so that the OS has time to write them to the devices).

I wonder how much the original paper (written in 1996) is effectively obsoleted by spread checkpoints, but the benchmark results posted by Horikawa-san suggest there's a possible gain. But perhaps partitioning the checkpoints is not the best approach?

regards

--
Tomas Vondra                   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to