I noticed another missing fsync() with unlogged tables, similar to the one at [1].

RelationCopyStorage does this:

        /*
         * When we WAL-logged rel pages, we must nonetheless fsync them.  The
         * reason is that since we're copying outside shared buffers, a 
CHECKPOINT
         * occurring during the copy has no way to flush the previously written
         * data to disk (indeed it won't know the new rel even exists).  A crash
         * later on would replay WAL from the checkpoint, therefore it wouldn't
         * replay our earlier WAL entries. If we do not fsync those pages here,
         * they might still not be on disk when the crash occurs.
         */
        if (use_wal ||  copying_initfork)
                smgrimmedsync(dst, forkNum);

That 'copying_initfork' condition is wrong. The first hint of that is that 'use_wal' is always true for an init fork. I believe this was meant to check for 'relpersistence == RELPERSISTENCE_UNLOGGED'. Otherwise, this bad thing can happen:

1. Create an unlogged table
2. ALTER TABLE unlogged_tbl SET TABLESPACE ... -- This calls RelationCopyStorage
3. a checkpoint happens while the command is running
4. After the ALTER TABLE has finished, shut down postgres cleanly.
5. Lose power

When you recover, the unlogged table is not reset, because it was a clean postgres shutdown. But the part of the file that was copied after the checkpoint at step 3 was never fsync'd, so part of the file is lost. I was able to reproduce with a VM that I kill to simulate step 5.

This is the same scenario that the smgrimmedsync() call above protects from for WAL-logged relations. But we got the condition wrong: instead of worrying about the init fork, we need to call smgrimmedsync() for all *other* forks of an unlogged relation.

Fortunately the fix is trivial, see attached. I suspect we have similar problems in a few other places, though. end_heap_rewrite(), _bt_load(), and gist_indexsortbuild have a similar-looking smgrimmedsync() calls, with no consideration for unlogged relations at all. I haven't tested those, but they look wrong to me.

I'm also attaching the scripts I used to reproduce this, although they will require some manual fiddling to run.

[1] https://www.postgresql.org/message-id/d47d8122-415e-425c-d0a2-e0160829702d%40iki.fi

--
Heikki Linnakangas
Neon (https://neon.tech)
diff --git a/src/backend/catalog/storage.c b/src/backend/catalog/storage.c
index 2add053489..ebb80fa822 100644
--- a/src/backend/catalog/storage.c
+++ b/src/backend/catalog/storage.c
@@ -533,7 +533,7 @@ RelationCopyStorage(SMgrRelation src, SMgrRelation dst,
 	 * replay our earlier WAL entries. If we do not fsync those pages here,
 	 * they might still not be on disk when the crash occurs.
 	 */
-	if (use_wal || copying_initfork)
+	if (use_wal || relpersistence == RELPERSISTENCE_UNLOGGED)
 		smgrimmedsync(dst, forkNum);
 }
 

Attachment: unlogged-fsynctest.sh
Description: application/shellscript

Attachment: unlogged-verify.sh
Description: application/shellscript

Reply via email to