Simon Riggs <[EMAIL PROTECTED]> writes:
> I enclose a complete patch for avoiding WAL usage for CREATE TABLE AS
> SELECT, when not in archive mode (PITR). The main use case for this is
> large BI environments that create summary tables or prejoined tables,
> though there are many general applications.

Applied after heavy corrections --- there were a number of things wrong
with this "simple" patch, starting with having gotten the tests
backwards :-(, and extending to not having actually flushed the data
before commit (smgrimmedsync isn't enough, you have to
FlushRelationBuffers).

A consideration we had all missed in the original discussions is that
if the transaction doesn't emit any WAL records at all,
RecordTransactionCommit will think that it need not WAL-log the
transaction commit, leading to the possibility that the commit is lost
even though all the data is preserved :-(

This is not a hazard for CREATE TABLE AS, since it will certainly have
emitted WAL records while creating the table's catalog entries.  It will
be a very real hazard for COPY however.  The cleanest solution I can
think of is that the COPY code should emit a WAL record for the first
tuple copied in, but not for later ones.  To this end, I separated the
"use_wal" and "use_fsm" aspects of what the patch was doing.

I didn't apply the freespace.c changes either; that struck me as a
serious kluge with no real benefit.  We can just omit updating the FSM's
running average, if it even has one.  (ISTM there's a reasonable
argument to be made that the tuple sizes during CREATE/COPY might not be
representative of later requests anyway.)

Patch as applied is attached.

                        regards, tom lane

*** src/backend/access/heap/heapam.c.orig       Wed Jun  8 11:50:21 2005
--- src/backend/access/heap/heapam.c    Mon Jun 20 13:50:16 2005
***************
*** 1034,1042 ****
   *
   * The new tuple is stamped with current transaction ID and the specified
   * command ID.
   */
  Oid
! heap_insert(Relation relation, HeapTuple tup, CommandId cid)
  {
        TransactionId xid = GetCurrentTransactionId();
        Buffer          buffer;
--- 1034,1053 ----
   *
   * The new tuple is stamped with current transaction ID and the specified
   * command ID.
+  *
+  * If use_wal is false, the new tuple is not logged in WAL, even for a
+  * non-temp relation.  Safe usage of this behavior requires that we arrange
+  * that all new tuples go into new pages not containing any tuples from other
+  * transactions, that the relation gets fsync'd before commit, and that the
+  * transaction emits at least one WAL record to ensure RecordTransactionCommit
+  * will decide to WAL-log the commit.
+  *
+  * use_fsm is passed directly to RelationGetBufferForTuple, which see for
+  * more info.
   */
  Oid
! heap_insert(Relation relation, HeapTuple tup, CommandId cid,
!                       bool use_wal, bool use_fsm)
  {
        TransactionId xid = GetCurrentTransactionId();
        Buffer          buffer;
***************
*** 1086,1092 ****
                heap_tuple_toast_attrs(relation, tup, NULL);
  
        /* Find buffer to insert this tuple into */
!       buffer = RelationGetBufferForTuple(relation, tup->t_len, InvalidBuffer);
  
        /* NO EREPORT(ERROR) from here till changes are logged */
        START_CRIT_SECTION();
--- 1097,1104 ----
                heap_tuple_toast_attrs(relation, tup, NULL);
  
        /* Find buffer to insert this tuple into */
!       buffer = RelationGetBufferForTuple(relation, tup->t_len,
!                                                                          
InvalidBuffer, use_fsm);
  
        /* NO EREPORT(ERROR) from here till changes are logged */
        START_CRIT_SECTION();
***************
*** 1096,1102 ****
        pgstat_count_heap_insert(&relation->pgstat_info);
  
        /* XLOG stuff */
!       if (!relation->rd_istemp)
        {
                xl_heap_insert xlrec;
                xl_heap_header xlhdr;
--- 1108,1119 ----
        pgstat_count_heap_insert(&relation->pgstat_info);
  
        /* XLOG stuff */
!       if (relation->rd_istemp)
!       {
!               /* No XLOG record, but still need to flag that XID exists on 
disk */
!               MyXactMadeTempRelUpdate = true;
!       }
!       else if (use_wal)
        {
                xl_heap_insert xlrec;
                xl_heap_header xlhdr;
***************
*** 1151,1161 ****
                PageSetLSN(page, recptr);
                PageSetTLI(page, ThisTimeLineID);
        }
-       else
-       {
-               /* No XLOG record, but still need to flag that XID exists on 
disk */
-               MyXactMadeTempRelUpdate = true;
-       }
  
        END_CRIT_SECTION();
  
--- 1168,1173 ----
***************
*** 1183,1189 ****
  Oid
  simple_heap_insert(Relation relation, HeapTuple tup)
  {
!       return heap_insert(relation, tup, GetCurrentCommandId());
  }
  
  /*
--- 1195,1201 ----
  Oid
  simple_heap_insert(Relation relation, HeapTuple tup)
  {
!       return heap_insert(relation, tup, GetCurrentCommandId(), true, true);
  }
  
  /*
***************
*** 1743,1749 ****
                {
                        /* Assume there's no chance to put newtup on same page. 
*/
                        newbuf = RelationGetBufferForTuple(relation, 
newtup->t_len,
!                                                                               
           buffer);
                }
                else
                {
--- 1755,1761 ----
                {
                        /* Assume there's no chance to put newtup on same page. 
*/
                        newbuf = RelationGetBufferForTuple(relation, 
newtup->t_len,
!                                                                               
           buffer, true);
                }
                else
                {
***************
*** 1760,1766 ****
                                 */
                                LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
                                newbuf = RelationGetBufferForTuple(relation, 
newtup->t_len,
!                                                                               
                   buffer);
                        }
                        else
                        {
--- 1772,1778 ----
                                 */
                                LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
                                newbuf = RelationGetBufferForTuple(relation, 
newtup->t_len,
!                                                                               
                   buffer, true);
                        }
                        else
                        {
*** src/backend/access/heap/hio.c.orig  Sat May  7 17:32:23 2005
--- src/backend/access/heap/hio.c       Mon Jun 20 13:50:16 2005
***************
*** 79,90 ****
   *    happen if space is freed in that page after heap_update finds there's 
not
   *    enough there).  In that case, the page will be pinned and locked only 
once.
   *
   *    ereport(ERROR) is allowed here, so this routine *must* be called
   *    before any (unlogged) changes are made in buffer pool.
   */
  Buffer
  RelationGetBufferForTuple(Relation relation, Size len,
!                                                 Buffer otherBuffer)
  {
        Buffer          buffer = InvalidBuffer;
        Page            pageHeader;
--- 79,104 ----
   *    happen if space is freed in that page after heap_update finds there's 
not
   *    enough there).  In that case, the page will be pinned and locked only 
once.
   *
+  *    If use_fsm is true (the normal case), we use FSM to help us find free
+  *    space.  If use_fsm is false, we always append a new empty page to the
+  *    end of the relation if the tuple won't fit on the current target page.
+  *    This can save some cycles when we know the relation is new and doesn't
+  *    contain useful amounts of free space.
+  *
+  *    The use_fsm = false case is also useful for non-WAL-logged additions to 
a
+  *    relation, if the caller holds exclusive lock and is careful to 
invalidate
+  *    relation->rd_targblock before the first insertion --- that ensures that
+  *    all insertions will occur into newly added pages and not be intermixed
+  *    with tuples from other transactions.  That way, a crash can't risk 
losing
+  *    any committed data of other transactions.  (See heap_insert's comments
+  *    for additional constraints needed for safe usage of this behavior.)
+  *
   *    ereport(ERROR) is allowed here, so this routine *must* be called
   *    before any (unlogged) changes are made in buffer pool.
   */
  Buffer
  RelationGetBufferForTuple(Relation relation, Size len,
!                                                 Buffer otherBuffer, bool 
use_fsm)
  {
        Buffer          buffer = InvalidBuffer;
        Page            pageHeader;
***************
*** 121,131 ****
         * on each page that proves not to be suitable.)  If the FSM has no
         * record of a page with enough free space, we give up and extend the
         * relation.
         */
  
        targetBlock = relation->rd_targblock;
  
!       if (targetBlock == InvalidBlockNumber)
        {
                /*
                 * We have no cached target page, so ask the FSM for an initial
--- 135,148 ----
         * on each page that proves not to be suitable.)  If the FSM has no
         * record of a page with enough free space, we give up and extend the
         * relation.
+        *
+        * When use_fsm is false, we either put the tuple onto the existing
+        * target page or extend the relation.
         */
  
        targetBlock = relation->rd_targblock;
  
!       if (targetBlock == InvalidBlockNumber && use_fsm)
        {
                /*
                 * We have no cached target page, so ask the FSM for an initial
***************
*** 208,213 ****
--- 225,234 ----
                        LockBuffer(otherBuffer, BUFFER_LOCK_UNLOCK);
                        ReleaseBuffer(buffer);
                }
+ 
+               /* Without FSM, always fall out of the loop and extend */
+               if (!use_fsm)
+                       break;
  
                /*
                 * Update FSM as to condition of this page, and ask for another
*** src/backend/executor/execMain.c.orig        Sun May 22 18:30:19 2005
--- src/backend/executor/execMain.c     Mon Jun 20 14:22:16 2005
***************
*** 33,38 ****
--- 33,39 ----
  #include "postgres.h"
  
  #include "access/heapam.h"
+ #include "access/xlog.h"
  #include "catalog/heap.h"
  #include "catalog/namespace.h"
  #include "commands/tablecmds.h"
***************
*** 44,49 ****
--- 45,51 ----
  #include "optimizer/clauses.h"
  #include "optimizer/var.h"
  #include "parser/parsetree.h"
+ #include "storage/smgr.h"
  #include "utils/acl.h"
  #include "utils/guc.h"
  #include "utils/lsyscache.h"
***************
*** 784,789 ****
--- 786,805 ----
                 * And open the constructed table for writing.
                 */
                intoRelationDesc = heap_open(intoRelationId, 
AccessExclusiveLock);
+ 
+               /* use_wal off requires rd_targblock be initially invalid */
+               Assert(intoRelationDesc->rd_targblock == InvalidBlockNumber);
+ 
+               /*
+                * We can skip WAL-logging the insertions, unless PITR is in 
use.
+                *
+                * Note that for a non-temp INTO table, this is safe only 
because
+                * we know that the catalog changes above will have been 
WAL-logged,
+                * and so RecordTransactionCommit will think it needs to 
WAL-log the
+                * eventual transaction commit.  Else the commit might be lost, 
even
+                * though all the data is safely fsync'd ...
+                */
+               estate->es_into_relation_use_wal = XLogArchivingActive();
        }
  
        estate->es_into_relation_descriptor = intoRelationDesc;
***************
*** 979,985 ****
--- 995,1016 ----
         * close the "into" relation if necessary, again keeping lock
         */
        if (estate->es_into_relation_descriptor != NULL)
+       {
+               /*
+                * If we skipped using WAL, and it's not a temp relation,
+                * we must force the relation down to disk before it's
+                * safe to commit the transaction.  This requires forcing
+                * out any dirty buffers and then doing a forced fsync.
+                */
+               if (!estate->es_into_relation_use_wal &&
+                       !estate->es_into_relation_descriptor->rd_istemp)
+               {
+                       
FlushRelationBuffers(estate->es_into_relation_descriptor);
+                       
smgrimmedsync(estate->es_into_relation_descriptor->rd_smgr);
+               }
+ 
                heap_close(estate->es_into_relation_descriptor, NoLock);
+    }
  
        /*
         * close any relations selected FOR UPDATE/FOR SHARE, again keeping 
locks
***************
*** 1307,1313 ****
  
                tuple = ExecCopySlotTuple(slot);
                heap_insert(estate->es_into_relation_descriptor, tuple,
!                                       estate->es_snapshot->curcid);
                /* we know there are no indexes to update */
                heap_freetuple(tuple);
                IncrAppended();
--- 1338,1346 ----
  
                tuple = ExecCopySlotTuple(slot);
                heap_insert(estate->es_into_relation_descriptor, tuple,
!                                       estate->es_snapshot->curcid,
!                                       estate->es_into_relation_use_wal,
!                                       false);         /* never any point in 
using FSM */
                /* we know there are no indexes to update */
                heap_freetuple(tuple);
                IncrAppended();
***************
*** 1386,1392 ****
         * insert the tuple
         */
        newId = heap_insert(resultRelationDesc, tuple,
!                                               estate->es_snapshot->curcid);
  
        IncrAppended();
        (estate->es_processed)++;
--- 1419,1426 ----
         * insert the tuple
         */
        newId = heap_insert(resultRelationDesc, tuple,
!                                               estate->es_snapshot->curcid,
!                                               true, true);
  
        IncrAppended();
        (estate->es_processed)++;
***************
*** 2089,2094 ****
--- 2123,2129 ----
        epqstate->es_result_relation_info = estate->es_result_relation_info;
        epqstate->es_junkFilter = estate->es_junkFilter;
        epqstate->es_into_relation_descriptor = 
estate->es_into_relation_descriptor;
+       epqstate->es_into_relation_use_wal = estate->es_into_relation_use_wal;
        epqstate->es_param_list_info = estate->es_param_list_info;
        if (estate->es_topPlan->nParamExec > 0)
                epqstate->es_param_exec_vals = (ParamExecData *)
*** src/backend/executor/execUtils.c.orig       Thu Apr 28 17:47:12 2005
--- src/backend/executor/execUtils.c    Mon Jun 20 13:08:30 2005
***************
*** 186,192 ****
--- 186,194 ----
        estate->es_result_relation_info = NULL;
  
        estate->es_junkFilter = NULL;
+ 
        estate->es_into_relation_descriptor = NULL;
+       estate->es_into_relation_use_wal = false;
  
        estate->es_param_list_info = NULL;
        estate->es_param_exec_vals = NULL;
*** src/backend/storage/smgr/md.c.orig  Sun May 29 00:23:05 2005
--- src/backend/storage/smgr/md.c       Mon Jun 20 14:26:50 2005
***************
*** 660,665 ****
--- 660,668 ----
  
  /*
   *    mdimmedsync() -- Immediately sync a relation to stable storage.
+  *
+  * Note that only writes already issued are synced; this routine knows
+  * nothing of dirty buffers that may exist inside the buffer manager.
   */
  bool
  mdimmedsync(SMgrRelation reln)
*** src/backend/storage/smgr/smgr.c.orig        Fri Jun 17 18:32:46 2005
--- src/backend/storage/smgr/smgr.c     Mon Jun 20 14:26:50 2005
***************
*** 650,656 ****
  /*
   *    smgrimmedsync() -- Force the specified relation to stable storage.
   *
!  *            Synchronously force all of the specified relation down to disk.
   *
   *            This is useful for building completely new relations (eg, new
   *            indexes).  Instead of incrementally WAL-logging the index build
--- 650,657 ----
  /*
   *    smgrimmedsync() -- Force the specified relation to stable storage.
   *
!  *            Synchronously force all previous writes to the specified 
relation
!  *            down to disk.
   *
   *            This is useful for building completely new relations (eg, new
   *            indexes).  Instead of incrementally WAL-logging the index build
***************
*** 664,669 ****
--- 665,674 ----
   *
   *            The preceding writes should specify isTemp = true to avoid
   *            duplicative fsyncs.
+  *
+  *            Note that you need to do FlushRelationBuffers() first if there 
is
+  *            any possibility that there are dirty buffers for the relation;
+  *            otherwise the sync is not very meaningful.
   */
  void
  smgrimmedsync(SMgrRelation reln)
*** src/include/access/heapam.h.orig    Mon Jun  6 13:01:24 2005
--- src/include/access/heapam.h Mon Jun 20 13:46:30 2005
***************
*** 156,162 ****
                                        ItemPointer tid);
  extern void setLastTid(const ItemPointer tid);
  
! extern Oid    heap_insert(Relation relation, HeapTuple tup, CommandId cid);
  extern HTSU_Result heap_delete(Relation relation, ItemPointer tid, 
ItemPointer ctid,
                        CommandId cid, Snapshot crosscheck, bool wait);
  extern HTSU_Result heap_update(Relation relation, ItemPointer otid, HeapTuple 
tup,
--- 156,163 ----
                                        ItemPointer tid);
  extern void setLastTid(const ItemPointer tid);
  
! extern Oid    heap_insert(Relation relation, HeapTuple tup, CommandId cid,
!                                               bool use_wal, bool use_fsm);
  extern HTSU_Result heap_delete(Relation relation, ItemPointer tid, 
ItemPointer ctid,
                        CommandId cid, Snapshot crosscheck, bool wait);
  extern HTSU_Result heap_update(Relation relation, ItemPointer otid, HeapTuple 
tup,
*** src/include/access/hio.h.orig       Fri Dec 31 17:46:38 2004
--- src/include/access/hio.h    Mon Jun 20 13:08:23 2005
***************
*** 19,24 ****
  extern void RelationPutHeapTuple(Relation relation, Buffer buffer,
                                         HeapTuple tuple);
  extern Buffer RelationGetBufferForTuple(Relation relation, Size len,
!                                                 Buffer otherBuffer);
  
  #endif   /* HIO_H */
--- 19,24 ----
  extern void RelationPutHeapTuple(Relation relation, Buffer buffer,
                                         HeapTuple tuple);
  extern Buffer RelationGetBufferForTuple(Relation relation, Size len,
!                                                                               
Buffer otherBuffer, bool use_fsm);
  
  #endif   /* HIO_H */
*** src/include/nodes/execnodes.h.orig  Fri Jun 17 14:54:18 2005
--- src/include/nodes/execnodes.h       Mon Jun 20 13:08:12 2005
***************
*** 304,310 ****
--- 304,312 ----
        ResultRelInfo *es_result_relation_info;         /* currently active 
array
                                                                                
                 * elt */
        JunkFilter *es_junkFilter;      /* currently active junk filter */
+ 
        Relation        es_into_relation_descriptor;    /* for SELECT INTO */
+       bool            es_into_relation_use_wal;
  
        /* Parameter info: */
        ParamListInfo es_param_list_info;       /* values of external params */
---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

               http://archives.postgresql.org

Reply via email to