On Fri, 2006-12-22 at 13:53 -0500, Bruce Momjian wrote:
> I assume other kernels have similar I/O smoothing, so that data sent to
> the kernel via write() gets to disk within 30 seconds.
> I assume write() is not our checkpoint performance problem, but the
> transfer to disk via fsync().
Well, its correct to say that the transfer to disk is the source of the
problem, but that doesn't only occur when we fsync(). There are actually
two disk storms that occur, because of the way the fs cache works. [Ron
referred to this effect uplist]
Linux 2.6+ will attempt to write to disk any dirty blocks in excess of a
certain threshold, the dirty _background_ratio, which defaults to 10% of
RAM. So when the checkpoint issues lots of write() calls, we generally
exceed the threshold and then begin io storm number 1 to get us back
down to the dirty_background_ratio. When we issue the fsync() calls, we
then begin io storm number 2, which takes us down close to zero dirty
blocks (on a dedicated server). [Thanks to my colleague Richard Kennedy
for confirming this via investigation; he's been on leave throughout
this discussion, regrettably].
Putting delays in is very simple and does help, however, how much it
helps depends upon:
- the number of dirty blocks in shared_buffers
- the dirty_background_ratio
- the number of dirty blocks in each file when we fsync()
For example, on a system with a very large RAM, and yet a medium write
workload, the dirty_background_ratio may never be exceeded. In that
case, all of the I/O happens during storm 2, so . If you set
dirty_background_ratio lower, then most of the writes happen during
During storm 2, the fsync calls write all dirty blocks in a file to
disk. In many cases, a few tables/files have all of the writes, so
adding a delay between fsyncs doesn't spread out the writes like you
would hope it would.
Most of the time, storm 1 and storm 2 run together in a continuous
stream, but sometimes you see a double peak. There is an overlap of a
few seconds between 1 and 2 in many cases.
Linux will also write blocks to disk after a period of inactivity,
dirty_expire_centisecs which by default is 30 seconds. So putting a
delay between storm1 and storm2 should help matters somewhat, but 30
secs is probably almost exactly the wrong number (by chance), though I
do like the idea.
> Perhaps a simple solution is to do the
> write()'s of all dirty buffers as we do now at checkpoint time, but
> delay 30 seconds and then do fsync() on all the files.
So yes, putting a short delay, say 10 seconds, in at that point should
help matters somewhat, sometimes. (But the exact number depends upon how
the OS is tuned.)
> The goal here is
> that during the 30-second delay, the kernel will be forcing data to the
> disk, so the fsync() we eventually do will only be for the write() of
> buffers during the 30-second delay, and because we wrote all dirty
> buffers 30 seconds ago, there shouldn't be too many of them.
...but not for that reason.
IMHO the best thing to do is to
1. put a short delay between the write() steps in the checkpoint
2. put a longer delay in between the write() phase and the fsync() phase
3. tune the OS writeback mechanism to help smoothing
Either set (1) and (2) as GUCs, or have code that reads the OS settings
and acts accordingly.
Or alternatively put, both Bruce and Itagaki-san have good ideas.
> > So, in the real world, one conclusion seems to be that our existing
> > method of tuning the background writer just isn't good enough for the
> > average user:
> > #bgwriter_delay = 200ms # 10-10000ms between rounds
> > #bgwriter_lru_percent = 1.0 # 0-100% of LRU buffers
> > scanned/round
> > #bgwriter_lru_maxpages = 5 # 0-1000 buffers max
> > written/round
> > #bgwriter_all_percent = 0.333 # 0-100% of all buffers
> > scanned/round
> > #bgwriter_all_maxpages = 5 # 0-1000 buffers max
> > written/round
> > These settings control what the bgwriter does, but they do not clearly
> > relate to the checkpoint timing, which is the purpose of the bgwriter,
> > and they don't change during the checkpoint interval, which is also less
> > than ideal. If set to aggressively, it writes too much, and if too low,
> > the checkpoint does too much I/O.
Yes, that's very clear.
> > We clearly need more bgwriter activity as the checkpoint approaches
I'd put it that we should write a block to disk prior to checkpoint if
it appears that it won't be dirtied again if we do so. That doesn't
necessarily translate directly into *more* activity.
> , and
> > one that is more auto-tuned, like many of our other parameters. I think
> > we created these settings to see how they worked in the field, so it
> > probably time to reevaluate them based on field reports.
> > I think the bgwriter should keep track of how far it is to the next
> > checkpoint, and use that information to increase write activity.
> > Basically now, during a checkpoint, the bgwriter does a full buffer scan
> > and fsync's all dirty files, so it changes from the configuration
> > parameter-defined behavior right to 100% activity. I think it would be
> > ideal if we could ramp up the writes so that when it is 95% to the next
> > checkpoint, it can be operating at 95% of the activity it would do
> > during a checkpoint.
> > My guess is if we can do that, we will have much smoother performance
> > because we have more WAL writes just after checkpoint for newly-dirtied
> > pages, and the new setup will give us more write activity just before
> > checkpoint.
Well, as long as the kernel ignores Postgres and Postgres ignores the
kernel, things will never be smooth (literally). If we write more, but
are still below the dirty_background_ratio, it won't make the slightest
bit of difference.
Trying to get trustworthy/explicable performance test reports is already
difficult for this reason.
IMHO, the best approach would be one that takes into how the OS behaves,
so we can work with it. Regrettably, I can't see any way of doing this
other than OS-specific code of some shape or form.
> > One other idea is for the bgwriter to use O_DIRECT or O_SYNC to avoid
> > the kernel cache, so we are sure data will be on disk by checkpoint
> > time. This was avoided in the past because of the expense of
> > second-guessing the kernel disk I/O scheduling algorithms.
Seems like a longer term best approach to me.
---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster