Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

KONDO Mitsumasa Wed, 26 Jun 2013 01:33:58 -0700

Thank you for comments!

>> On Tue, Jun 25, 2013 at 1:15 PM, Heikki Linnakangas

Hmm, so the write patch doesn't do much, but the fsync patch makes the response
times somewhat smoother. I'd suggest that we drop the write patch for now, and

>>> focus on the fsyncs.
Write patch is effective in TPS! I think that delay of checkpoint write is 
caused
long time fsync and heavy load in fsync phase. Because it go slow disk right in 
write

phase. Therefore, combination of write patch and fsync patch are suiter eachother thanonly write patch. I think that amount of WAL write in beginning of checkpoint canindicate effect of write patch.

>>> What checkpointer_fsync_delay_ratio and checkpointer_fsync_delay_threshold>>> settings did you use with the fsync patch? It's disabled by default.

I used these parameters.
  checkpointer_fsync_delay_ratio = 1
  checkpointer_fsync_delay_threshold = 1000ms
As a matter of fact, I used long time sleep in slow fsyncs.

And other maintains parameters are here.
  checkpoint_completion_target = 0.7
  checkpoint_smooth_target = 0.3
  checkpoint_smooth_margin = 0.5
  checkpointer_write_delay = 200ms

Attached is a quick patch to implement a fixed, 100ms delay between fsyncs, and 
the
assumption that fsync phase is 10% of the total checkpoint duration. I suspect 
100ms

>>> is too small to have much effect, but that happens to be what we havecurrently in

CheckpointWriteDelay(). Could you test this patch along with yours? If you can 
test
with different delays (e.g 100ms, 500ms and 1000ms) and different ratios between
the write and fsync phase (e.g 0.5, 0.7, 0.9), to get an idea of how sensitive 
the
test case is to those settings.

It seems interesting algorithm! I will test it in same setting and study aboutyour patch essence.



(2013/06/26 5:28), Heikki Linnakangas wrote:

On 25.06.2013 23:03, Robert Haas wrote:

On Tue, Jun 25, 2013 at 1:15 PM, Heikki Linnakangas
<hlinnakan...@vmware.com>  wrote:

I'm not sure it's a good idea to sleep proportionally to the time it took to
complete the previous fsync. If you have a 1GB cache in the RAID controller,
fsyncing the a 1GB segment will fill it up. But since it fits in cache, it
will return immediately. So we proceed fsyncing other files, until the cache
is full and the fsync blocks. But once we fill up the cache, it's likely
that we're hurting concurrent queries. ISTM it would be better to stay under
that threshold, keeping the I/O system busy, but never fill up the cache
completely.


Isn't the behavior implemented by the patch a reasonable approximation
of just that?  When the fsyncs start to get slow, that's when we start
to sleep.   I'll grant that it would be better to sleep when the
fsyncs are *about* to get slow, rather than when they actually have
become slow, but we have no way to know that.


Well, that's the point I was trying to make: you should sleep *before* the 
fsyncs
get slow.

Actuary, fsync time is changed by progress of background disk writes in OS. Wecannot know about progress of background disk write before fsyncs. I thinkRobert's argument is right. Please see under following log messages.


* fsync file which had been already wrote in disk
 DEBUG:  00000: checkpoint sync: number=23 file=base/16384/16413.5 time=2.546 
msec
 DEBUG:  00000: checkpoint sync: number=24 file=base/16384/16413.6 time=3.174 
msec
 DEBUG:  00000: checkpoint sync: number=25 file=base/16384/16413.7 time=2.358 
msec
 DEBUG:  00000: checkpoint sync: number=26 file=base/16384/16413.8 time=2.013 
msec

DEBUG: 00000: checkpoint sync: number=27 file=base/16384/16413.9 time=1232.535msec

 DEBUG:  00000: checkpoint sync: number=28 file=base/16384/16413_fsm time=0.005 
msec

* fsync file which had not been wrote in disk very much

DEBUG: 00000: checkpoint sync: number=54 file=base/16384/16419.8 time=3408.759msecDEBUG: 00000: checkpoint sync: number=55 file=base/16384/16419.9 time=3857.075msecDEBUG: 00000: checkpoint sync: number=56 file=base/16384/16419.10time=13848.237 msecDEBUG: 00000: checkpoint sync: number=57 file=base/16384/16419.11 time=898.836msec

 DEBUG:  00000: checkpoint sync: number=58 file=base/16384/16419_fsm time=0.004 
msec
 DEBUG:  00000: checkpoint sync: number=59 file=base/16384/16419_vm time=0.002 
msec

I think it is wasteful of sleep every fsyncs including short time, and fsync timeperformance is also changed by hardware which is like RAID card and kind of ornumber of disks and OS. So it is difficult to set fixed-sleep-time. My proposedmethod will be more adoptive in these cases.

The only feedback we have on how bad things are is how long it took
the last fsync to complete, so I actually think that's a much better
way to go than any fixed sleep - which will often be unnecessarily
long on a well-behaved system, and which will often be far too short
on one that's having trouble. I'm inclined to think think Kondo-san
has got it right.


Quite possible, I really don't know. I'm inclined to first try the simplest 
thing
possible, and only make it more complicated if that's not good enough.
Kondo-san's patch wasn't very complicated, but nevertheless a fixed sleep 
between
every fsync, unless you're behind the schedule, is even simpler. In particular,
it's easier to tie that into the checkpoint scheduler - I'm not sure how you'd
measure progress or determine how long to sleep unless you assume that every
fsync is the same.

I think it is important in phase of fsync that short time as possible without IOfreeze, keep schedule of checkpoint, and good for executing transactions. I tryto make progress patch in that's point of view. By the way, executing DBT-2benchmark has long time(It may be four hours.). For that reason I hope that don'tmind my late reply very much! :-)


Best Regards,
--
Mitsumasa KONDO
NTT Open Sorce Software Center


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Improvement of checkpoint IO scheduler for stable transaction responses

Reply via email to