On Thu, Dec 8, 2016 at 6:03 AM, Robert Haas <robertmh...@gmail.com> wrote:
> Michael, your greetings were passed on to me with a request that I
> look at this thread.

Thanks for showing up!

> On Fri, Nov 18, 2016 at 12:33 PM, Michael Paquier
> <michael.paqu...@gmail.com> wrote:
>>>> More seriously, if there could be more details regarding that, I would
>>>> think that we could say something like "logging the init fork is
>>>> mandatory in any case to ensure its on-disk presence when at recovery
>>>> replay, even on non-default tablespaces whose base location are
>>>> deleted and re-created from scratch if the WAL record in charge of
>>>> creating this tablespace gets replayed". The problem shows up because
>>>> of tablespaces being deleted at replay at the end... So perhaps this
>>>> makes sense. What do you think?
>>>
>>> Yes, that's about what I think we need to explain.
>>
>> OK, what do you think about the attached then?
>>
>> I came up with something like that for those code paths:
>> -   /* Write the page.  If archiving/streaming, XLOG it. */
>> +   /*
>> +    * Write the page and log it unconditionally.  This is important
>> +    * particularly for indexes created on tablespaces and databases
>> +    * whose creation happened after the last redo pointer as recovery
>> +    * removes any of their existing content when the corresponding
>> +    * create records are replayed.
>> +    */
>> I have not been able to use the word "create" less than that. The
>> patch is really repetitive, but I think that we had better mention the
>> need of logging the content in each code path and not touch
>> log_newpage() to keep it a maximum flexible.
>
> In blinsert.c, nbtree.c, etc. how about:
>
> Write the page and log it.  It might seem that an immediate sync would
> be sufficient to guarantee that the file exists on disk, but recovery
> itself might remove it while replaying, for example, an
> XLOG_DBASE_CREATE record.  Therefore, we need this even when
> wal_level=minimal.

OK, I rewrote a bit the patch as attached. What do you think?

>>> Actually, I'm
>>> wondering if we ought to just switch this over to using the delayChkpt
>>> mechanism instead of going through this complicated song-and-dance.
>>> Incurring an immediate sync to avoid having to WAL-log this is a bit
>>> tenuous, but having to WAL-log it AND do the immediate sync just seems
>>> silly, and I'm actually a bit worried that even with your fix there
>>> might still be a latent bug somewhere.  We couldn't switch mechanisms
>>> cleanly in the 9.2 branch (cf.
>>> f21bb9cfb5646e1793dcc9c0ea697bab99afa523) so perhaps we should
>>> back-patch it in the form you have it in already, but it's probably
>>> worth switching over at least in master.
>>
>> Thinking through it... Could we not just rip off the sync phase
>> because there is delayChkpt? It seems to me that what counts is that
>> the commit of the transaction that creates the relation does not get
>> past the redo point. It would matter for read uncommitted transactions
>> but that concept does not exist in PG, and never will. On
>> back-branches it is definitely safer to keep the sync, I am just
>> thinking about a HEAD-only optimization here as you do.
>
> Right (I think).  If we set and clear delayChkpt around this work, we
> don't need the immediate sync.

My point is a bit different than what you mean I think: the
transaction creating an unlogged relfilenode would not need to even
set delayChkpt in the empty() routines because other transactions
would not refer to it until this transaction has committed. So I am
arguing about just removing the sync phase.
-- 
Michael
diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c
index 0946aa29ec..a31149f044 100644
--- a/contrib/bloom/blinsert.c
+++ b/contrib/bloom/blinsert.c
@@ -164,13 +164,18 @@ blbuildempty(Relation index)
        metapage = (Page) palloc(BLCKSZ);
        BloomFillMetapage(index, metapage);
 
-       /* Write the page.  If archiving/streaming, XLOG it. */
+       /*
+        * Write the page and log it.  It might seem that an immediate sync
+        * would be sufficient to guarantee that the file exists on disk, but
+        * recovery itself might remove it while replaying, for example, an
+        * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we
+        * need this even when wal_level=minimal.
+        */
        PageSetChecksumInplace(metapage, BLOOM_METAPAGE_BLKNO);
        smgrwrite(index->rd_smgr, INIT_FORKNUM, BLOOM_METAPAGE_BLKNO,
                          (char *) metapage, true);
-       if (XLogIsNeeded())
-               log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
-                                       BLOOM_METAPAGE_BLKNO, metapage, false);
+       log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
+                               BLOOM_METAPAGE_BLKNO, metapage, false);
 
        /*
         * An immediate sync is required even if we xlog'd the page, because the
diff --git a/src/backend/access/nbtree/nbtree.c 
b/src/backend/access/nbtree/nbtree.c
index 128744c5b7..a264b92c13 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -242,13 +242,18 @@ btbuildempty(Relation index)
        metapage = (Page) palloc(BLCKSZ);
        _bt_initmetapage(metapage, P_NONE, 0);
 
-       /* Write the page.  If archiving/streaming, XLOG it. */
+       /*
+        * Write the page and log it.  It might seem that an immediate sync
+        * would be sufficient to guarantee that the file exists on disk, but
+        * recovery itself might remove it while replaying, for example, an
+        * XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE record.  Therefore, we
+        * need this even when wal_level=minimal.
+        */
        PageSetChecksumInplace(metapage, BTREE_METAPAGE);
        smgrwrite(index->rd_smgr, INIT_FORKNUM, BTREE_METAPAGE,
                          (char *) metapage, true);
-       if (XLogIsNeeded())
-               log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
-                                       BTREE_METAPAGE, metapage, false);
+       log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
+                               BTREE_METAPAGE, metapage, false);
 
        /*
         * An immediate sync is required even if we xlog'd the page, because the
diff --git a/src/backend/access/spgist/spginsert.c 
b/src/backend/access/spgist/spginsert.c
index 01c8d213f5..3eaa649eff 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -161,13 +161,18 @@ spgbuildempty(Relation index)
        page = (Page) palloc(BLCKSZ);
        SpGistInitMetapage(page);
 
-       /* Write the page.  If archiving/streaming, XLOG it. */
+       /*
+        * Write the page and log it unconditionally.  This is important
+        * particularly for indexes created on tablespaces and databases
+        * whose creation happened after the last redo pointer as recovery
+        * removes any of their existing content when the corresponding
+        * create records are replayed.
+        */
        PageSetChecksumInplace(page, SPGIST_METAPAGE_BLKNO);
        smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_METAPAGE_BLKNO,
                          (char *) page, true);
-       if (XLogIsNeeded())
-               log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
-                                       SPGIST_METAPAGE_BLKNO, page, false);
+       log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
+                               SPGIST_METAPAGE_BLKNO, page, false);
 
        /* Likewise for the root page. */
        SpGistInitPage(page, SPGIST_LEAF);
@@ -175,9 +180,8 @@ spgbuildempty(Relation index)
        PageSetChecksumInplace(page, SPGIST_ROOT_BLKNO);
        smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_ROOT_BLKNO,
                          (char *) page, true);
-       if (XLogIsNeeded())
-               log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
-                                       SPGIST_ROOT_BLKNO, page, true);
+       log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
+                               SPGIST_ROOT_BLKNO, page, true);
 
        /* Likewise for the null-tuples root page. */
        SpGistInitPage(page, SPGIST_LEAF | SPGIST_NULLS);
@@ -185,9 +189,8 @@ spgbuildempty(Relation index)
        PageSetChecksumInplace(page, SPGIST_NULL_BLKNO);
        smgrwrite(index->rd_smgr, INIT_FORKNUM, SPGIST_NULL_BLKNO,
                          (char *) page, true);
-       if (XLogIsNeeded())
-               log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
-                                       SPGIST_NULL_BLKNO, page, true);
+       log_newpage(&index->rd_smgr->smgr_rnode.node, INIT_FORKNUM,
+                               SPGIST_NULL_BLKNO, page, true);
 
        /*
         * An immediate sync is required even if we xlog'd the pages, because 
the
diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index 7f5bad0b5d..5ffea74855 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -1380,18 +1380,19 @@ heap_create_with_catalog(const char *relname,
 
 /*
  * Set up an init fork for an unlogged table so that it can be correctly
- * reinitialized on restart.  Since we're going to do an immediate sync, we
- * only need to xlog this if archiving or streaming is enabled.  And the
- * immediate sync is required, because otherwise there's no guarantee that
- * this will hit the disk before the next checkpoint moves the redo pointer.
+ * reinitialized on restart.  An immediate sync is required even if the
+ * page has been logged, because the write did not go through
+ * shared_buffers and therefore a concurrent checkpoint may have moved
+ * the redo pointer past our xlog record.  Recovery may as well remove it
+ * while replaying, for example, XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE
+ * record. Therefore, logging is necessary even if wal_level=minimal.
  */
 void
 heap_create_init_fork(Relation rel)
 {
        RelationOpenSmgr(rel);
        smgrcreate(rel->rd_smgr, INIT_FORKNUM, false);
-       if (XLogIsNeeded())
-               log_smgrcreate(&rel->rd_smgr->smgr_rnode.node, INIT_FORKNUM);
+       log_smgrcreate(&rel->rd_smgr->smgr_rnode.node, INIT_FORKNUM);
        smgrimmedsync(rel->rd_smgr, INIT_FORKNUM);
 }
 
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to