Re: [HACKERS] WAL replay logic (was Re: [PERFORM] Mount options for Ext3?)

Bruce Momjian Fri, 14 Feb 2003 06:30:02 -0800

Is there a TODO here, like "Allow recovery from corrupt pg_control via
WAL"?


---------------------------------------------------------------------------

Kevin Brown wrote:
> Tom Lane wrote:
> > Kevin Brown <[EMAIL PROTECTED]> writes:
> > > One question I have is: in the event of a crash, why not simply replay
> > > all the transactions found in the WAL?  Is the startup time of the
> > > database that badly affected if pg_control is ignored?
> > 
> > Interesting thought, indeed.  Since we truncate the WAL after each
> > checkpoint, seems like this approach would no more than double the time
> > for restart.  
> 
> Hmm...truncating the WAL after each checkpoint minimizes the amount of
> disk space eaten by the WAL, but on the other hand keeping older
> segments around buys you some safety in the event that things get
> really hosed.  But your later comments make it sound like the older
> WAL segments are kept around anyway, just rotated.
> 
> > The win is it'd eliminate pg_control as a single point of
> > failure.  It's always bothered me that we have to update pg_control on
> > every checkpoint --- it should be a write-pretty-darn-seldom file,
> > considering how critical it is.
> > 
> > I think we'd have to make some changes in the code for deleting old
> > WAL segments --- right now it's not careful to delete them in order.
> > But surely that can be coped with.
> 
> Even that might not be necessary.  See below.
> 
> > OTOH, this might just move the locus for fatal failures out of
> > pg_control and into the OS' algorithms for writing directory updates.
> > We would have no cross-check that the set of WAL file names visible in
> > pg_xlog is sensible or aligned with the true state of the datafile
> > area.
> 
> Well, what we somehow need to guarantee is that there is always WAL
> data that is older than the newest consistent data in the datafile
> area, right?  Meaning that if the datafile area gets scribbled on in
> an inconsistent manner, you always have WAL data to fill in the gaps.
> 
> Right now we do that by using fsync() and sync().  But I think it
> would be highly desirable to be able to more or less guarantee
> database consistency even if fsync were turned off.  The price for
> that might be too high, though.
> 
> > We'd have to take it on faith that we should replay the visible files
> > in their name order.  This might mean we'd have to abandon the current
> > hack of recycling xlog segments by renaming them --- which would be a
> > nontrivial performance hit.
> 
> It's probably a bad idea for the replay to be based on the filenames.
> Instead, it should probably be based strictly on the contents of the
> xlog segment files.  Seems to me the beginning of each segment file
> should have some kind of header information that makes it clear where
> in the scheme of things it belongs.  Additionally, writing some sort
> of checksum, either at the beginning or the end, might not be a bad
> idea either (doesn't have to be a strict checksum, but it needs to be
> something that's reasonably likely to catch corruption within a
> segment).
> 
> Do that, and you don't have to worry about renaming xlog segments at
> all: you simply move on to the next logical segment in the list (a
> replay just reads the header info for all the segments and orders the
> list as it sees fit, and discards all segments prior to any gap it
> finds.  It may be that you simply have to bail out if you find a gap,
> though).  As long as the xlog segment checksum information is
> consistent with the contents of the segment and as long as its
> transactions pick up where the previous segment's left off (assuming
> it's not the first segment, of course), you can safely replay the
> transactions it contains.
> 
> I presume we're recycling xlog segments in order to avoid file
> creation and unlink overhead?  Otherwise you can simply create new
> segments as needed and unlink old segments as policy dictates.
> 
> > Comments anyone?
> > 
> > > If there exists somewhere a reasonably succinct description of the
> > > reasoning behind the current transaction management scheme (including
> > > an analysis of the pros and cons), I'd love to read it and quit
> > > bugging you.  :-)
> > 
> > Not that I know of.  Would you care to prepare such a writeup?  There
> > is a lot of material in the source-code comments, but no coherent
> > presentation.
> 
> Be happy to.  Just point me to any non-obvious source files.
> 
> Thus far on my plate:
> 
>     1.  PID file locking for postmaster startup (doesn't strictly need
>       to be the PID file but it may as well be, since we're already
>       messing with it anyway).  I'm currently looking at how to do
>       the autoconf tests, since I've never developed using autoconf
>       before.
> 
>     2.  Documenting the transaction management scheme.
> 
> I was initially interested in implementing the explicit JOIN
> reordering but based on your recent comments I think you have a much
> better handle on that than I.  I'll be very interested to see what you
> do, to see if it's anything close to what I figure has to happen...
> 
> 
> -- 
> Kevin Brown                                         [EMAIL PROTECTED]
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
> 
> http://archives.postgresql.org
> 

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  [EMAIL PROTECTED]               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

http://archives.postgresql.org

Re: [HACKERS] WAL replay logic (was Re: [PERFORM] Mount options for Ext3?)

Reply via email to