Thank you for comments!

>> On Tue, Jun 25, 2013 at 1:15 PM, Heikki Linnakangas
Hmm, so the write patch doesn't do much, but the fsync patch makes the response
times somewhat smoother. I'd suggest that we drop the write patch for now, and
>>> focus on the fsyncs.
Write patch is effective in TPS! I think that delay of checkpoint write is 
long time fsync and heavy load in fsync phase. Because it go slow disk right in 
phase. Therefore, combination of write patch and fsync patch are suiter each other than only write patch. I think that amount of WAL write in beginning of checkpoint can indicate effect of write patch.

>>> What checkpointer_fsync_delay_ratio and checkpointer_fsync_delay_threshold >>> settings did you use with the fsync patch? It's disabled by default.
I used these parameters.
  checkpointer_fsync_delay_ratio = 1
  checkpointer_fsync_delay_threshold = 1000ms
As a matter of fact, I used long time sleep in slow fsyncs.

And other maintains parameters are here.
  checkpoint_completion_target = 0.7
  checkpoint_smooth_target = 0.3
  checkpoint_smooth_margin = 0.5
  checkpointer_write_delay = 200ms

Attached is a quick patch to implement a fixed, 100ms delay between fsyncs, and 
assumption that fsync phase is 10% of the total checkpoint duration. I suspect 
>>> is too small to have much effect, but that happens to be what we have currently in
CheckpointWriteDelay(). Could you test this patch along with yours? If you can 
with different delays (e.g 100ms, 500ms and 1000ms) and different ratios between
the write and fsync phase (e.g 0.5, 0.7, 0.9), to get an idea of how sensitive 
test case is to those settings.
It seems interesting algorithm! I will test it in same setting and study about your patch essence.

(2013/06/26 5:28), Heikki Linnakangas wrote:
On 25.06.2013 23:03, Robert Haas wrote:
On Tue, Jun 25, 2013 at 1:15 PM, Heikki Linnakangas
<>  wrote:
I'm not sure it's a good idea to sleep proportionally to the time it took to
complete the previous fsync. If you have a 1GB cache in the RAID controller,
fsyncing the a 1GB segment will fill it up. But since it fits in cache, it
will return immediately. So we proceed fsyncing other files, until the cache
is full and the fsync blocks. But once we fill up the cache, it's likely
that we're hurting concurrent queries. ISTM it would be better to stay under
that threshold, keeping the I/O system busy, but never fill up the cache

Isn't the behavior implemented by the patch a reasonable approximation
of just that?  When the fsyncs start to get slow, that's when we start
to sleep.   I'll grant that it would be better to sleep when the
fsyncs are *about* to get slow, rather than when they actually have
become slow, but we have no way to know that.

Well, that's the point I was trying to make: you should sleep *before* the 
get slow.
Actuary, fsync time is changed by progress of background disk writes in OS. We cannot know about progress of background disk write before fsyncs. I think Robert's argument is right. Please see under following log messages.

* fsync file which had been already wrote in disk
 DEBUG:  00000: checkpoint sync: number=23 file=base/16384/16413.5 time=2.546 
 DEBUG:  00000: checkpoint sync: number=24 file=base/16384/16413.6 time=3.174 
 DEBUG:  00000: checkpoint sync: number=25 file=base/16384/16413.7 time=2.358 
 DEBUG:  00000: checkpoint sync: number=26 file=base/16384/16413.8 time=2.013 
DEBUG: 00000: checkpoint sync: number=27 file=base/16384/16413.9 time=1232.535 msec
 DEBUG:  00000: checkpoint sync: number=28 file=base/16384/16413_fsm time=0.005 

* fsync file which had not been wrote in disk very much
DEBUG: 00000: checkpoint sync: number=54 file=base/16384/16419.8 time=3408.759 msec DEBUG: 00000: checkpoint sync: number=55 file=base/16384/16419.9 time=3857.075 msec DEBUG: 00000: checkpoint sync: number=56 file=base/16384/16419.10 time=13848.237 msec DEBUG: 00000: checkpoint sync: number=57 file=base/16384/16419.11 time=898.836 msec
 DEBUG:  00000: checkpoint sync: number=58 file=base/16384/16419_fsm time=0.004 
 DEBUG:  00000: checkpoint sync: number=59 file=base/16384/16419_vm time=0.002 

I think it is wasteful of sleep every fsyncs including short time, and fsync time performance is also changed by hardware which is like RAID card and kind of or number of disks and OS. So it is difficult to set fixed-sleep-time. My proposed method will be more adoptive in these cases.

The only feedback we have on how bad things are is how long it took
the last fsync to complete, so I actually think that's a much better
way to go than any fixed sleep - which will often be unnecessarily
long on a well-behaved system, and which will often be far too short
on one that's having trouble. I'm inclined to think think Kondo-san
has got it right.

Quite possible, I really don't know. I'm inclined to first try the simplest 
possible, and only make it more complicated if that's not good enough.
Kondo-san's patch wasn't very complicated, but nevertheless a fixed sleep 
every fsync, unless you're behind the schedule, is even simpler. In particular,
it's easier to tie that into the checkpoint scheduler - I'm not sure how you'd
measure progress or determine how long to sleep unless you assume that every
fsync is the same.
I think it is important in phase of fsync that short time as possible without IO freeze, keep schedule of checkpoint, and good for executing transactions. I try to make progress patch in that's point of view. By the way, executing DBT-2 benchmark has long time(It may be four hours.). For that reason I hope that don't mind my late reply very much! :-)

Best Regards,
Mitsumasa KONDO
NTT Open Sorce Software Center

Sent via pgsql-hackers mailing list (
To make changes to your subscription:

Reply via email to