Re: [HACKERS] Load distributed checkpoint

Bruce Momjian Fri, 22 Dec 2006 10:53:45 -0800

I have a new idea.  Rather than increasing write activity as we approach
checkpoint, I think there is an easier solution.  I am very familiar
with the BSD kernel, and it seems they have a similar issue in trying to
smooth writes:
        
        
http://www.brno.cas.cz/cgi-bin/bsdi-man?proto=1.1&query=update&msection=4&apropos=0
        
        UPDATE(4)                   BSD Programmer's Manual                  
UPDATE(4)
        
        NAME
             update - trickle sync filesystem caches to disk
        
        DESCRIPTION
             At system boot time, the kernel starts filesys_syncer, process
             3.  This process helps protect the integrity of disk volumes
             by ensuring that volatile cached filesystem data are written
             to disk within the vfs.generic.syncdelay interval which defaults
             to thirty seconds (see sysctl(8)).  When a vnode is first
             written it is placed vfs.generic.syncdelay seconds down on
             the trickle sync queue.  If it still exists and has dirty data
             when it reaches the top of the queue, filesys_syncer writes
             it to disk.  This approach evens out the load on the underlying
             I/O system and avoids writing short-lived files.  The pa- pers
             on trickle-sync tend to favor aging based on buffers rather
             than files.  However, BSD/OS synchronizes on file age rather
             than buffer age because the data structures are much smaller
             as there are typically far fewer files than buffers.  Although
             this can make the I/O bursty when a big file is written to
             disk, it is still much better than the wholesale writes that
             were being done by the historic update process which wrote
             all dirty data buffers every 30 seconds.  It also adapts much
             better to the soft update code which wants to control aging
             to improve performance (inodes age in one third of
             vfs.generic.syncdelay seconds, directories in one half of
             vfs.generic.syncdelay seconds).  This ordering ensures that
             most dependencies are gone (e.g., inodes are written when
             directory en- tries want to go to disk) reducing the amount
             of work that the soft up- date code needs to do.


I assume other kernels have similar I/O smoothing, so that data sent to
the kernel via write() gets to disk within 30 seconds.  

I assume write() is not our checkpoint performance problem, but the
transfer to disk via fsync().  Perhaps a simple solution is to do the
write()'s of all dirty buffers as we do now at checkpoint time, but
delay 30 seconds and then do fsync() on all the files.  The goal here is
that during the 30-second delay, the kernel will be forcing data to the
disk, so the fsync() we eventually do will only be for the write() of
buffers during the 30-second delay, and because we wrote all dirty
buffers 30 seconds ago, there shouldn't be too many of them.

I think the basic difference between this and the proposed patch is that
we do not put delays in the buffer write() or fsync() phases --- we just
put a delay _between_ the phases, and wait for the kernel to smooth it
out for us.  The kernel certainly knows more about what needs to get to
disk, so it seems logical to let it do the I/O smoothing.

---------------------------------------------------------------------------

Bruce Momjian wrote:
> 
> I have thought a while about this and I have some ideas.
> 
> Ideally, we would be able to trickle the sync of individuals blocks
> during the checkpoint, but we can't because we rely on the kernel to
> sync all dirty blocks that haven't made it to disk using fsync().  We
> could trickle the fsync() calls, but that just extends the amount of
> data we are writing that has been dirtied post-checkpoint.  In an ideal
> world, we would be able to fsync() only part of a file at a time, and
> only those blocks that were dirtied pre-checkpoint, but I don't see that
> happening anytime soon (and one reason why many commercial databases
> bypass the kernel cache).
> 
> So, in the real world, one conclusion seems to be that our existing
> method of tuning the background writer just isn't good enough for the
> average user:
> 
>       #bgwriter_delay = 200ms                 # 10-10000ms between rounds
>       #bgwriter_lru_percent = 1.0             # 0-100% of LRU buffers 
> scanned/round
>       #bgwriter_lru_maxpages = 5              # 0-1000 buffers max 
> written/round
>       #bgwriter_all_percent = 0.333           # 0-100% of all buffers 
> scanned/round
>       #bgwriter_all_maxpages = 5              # 0-1000 buffers max 
> written/round
> 
> These settings control what the bgwriter does, but they do not clearly
> relate to the checkpoint timing, which is the purpose of the bgwriter,
> and they don't change during the checkpoint interval, which is also less
> than ideal.  If set to aggressively, it writes too much, and if too low,
> the checkpoint does too much I/O.
> 
> We clearly need more bgwriter activity as the checkpoint approaches, and
> one that is more auto-tuned, like many of our other parameters.  I think
> we created these settings to see how they worked in the field, so it
> probably time to reevaluate them based on field reports.
> 
> I think the bgwriter should keep track of how far it is to the next
> checkpoint, and use that information to increase write activity. 
> Basically now, during a checkpoint, the bgwriter does a full buffer scan
> and fsync's all dirty files, so it changes from the configuration
> parameter-defined behavior right to 100% activity.  I think it would be
> ideal if we could ramp up the writes so that when it is 95% to the next
> checkpoint, it can be operating at 95% of the activity it would do
> during a checkpoint.
> 
> My guess is if we can do that, we will have much smoother performance
> because we have more WAL writes just after checkpoint for newly-dirtied
> pages, and the new setup will give us more write activity just before
> checkpoint.
> 
> One other idea is for the bgwriter to use O_DIRECT or O_SYNC to avoid
> the kernel cache, so we are sure data will be on disk by checkpoint
> time.  This was avoided in the past because of the expense of
> second-guessing the kernel disk I/O scheduling algorithms.
> 
> ---------------------------------------------------------------------------
> 
> Tom Lane wrote:
> > "Kevin Grittner" <[EMAIL PROTECTED]> writes:
> > > "Jim C. Nasby" <[EMAIL PROTECTED]> wrote: 
> > >> Generally, I try and configure the all* settings so that you'll get 1
> > >> clock-sweep per checkpoint_timeout. It's worked pretty well, but I don't
> > >> have any actual tests to back that methodology up.
> > 
> > > We got to these numbers somewhat scientifically.  I studied I/O
> > > patterns under production load and figured we should be able to handle
> > > about 800 writes in per 200 ms without causing problems.  I have to
> > > admit that I based the percentages and the ratio between "all" and "lru"
> > > on gut feel after musing over the documentation.
> > 
> > I like Kevin's settings better than what Jim suggests.  If the bgwriter
> > only makes one sweep between checkpoints then it's hardly going to make
> > any impact at all on the number of dirty buffers the checkpoint will
> > have to write.  The point of the bgwriter is to reduce the checkpoint
> > I/O spike by doing writes between checkpoints, and to have any
> > meaningful impact on that, you'll need it to make the cycle several times.
> > 
> > Another point here is that you want checkpoints to be pretty far apart
> > to minimize the WAL load from full-page images.  So again, a bgwriter
> > that's only making one loop per checkpoint is not gonna be doing much.
> > 
> > I wonder whether it would be feasible to teach the bgwriter to get more
> > aggressive as the time for the next checkpoint approaches?  Writes
> > issued early in the interval have a much higher probability of being
> > wasted (because the page gets re-dirtied later).  But maybe that just
> > reduces to what Takahiro-san already suggested, namely that
> > checkpoint-time writes should be done with the same kind of scheduling
> > the bgwriter uses outside checkpoints.  We still have the problem that
> > the real I/O storm is triggered by fsync() not write(), and we don't
> > have a way to spread out the consequences of fsync().
> > 
> >                     regards, tom lane
> > 
> > ---------------------------(end of broadcast)---------------------------
> > TIP 9: In versions below 8.0, the planner will ignore your desire to
> >        choose an index scan if your joining column's datatypes do not
> >        match
> 
> -- 
>   Bruce Momjian   [EMAIL PROTECTED]
>   EnterpriseDB    http://www.enterprisedb.com
> 
>   + If your life is a hard drive, Christ can be your backup. +
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
> 
>                http://archives.postgresql.org

-- 
  Bruce Momjian   [EMAIL PROTECTED]
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

Re: [HACKERS] Load distributed checkpoint

Reply via email to