Re: [HACKERS] WAL logging problem in 9.4.3?

Kyotaro HORIGUCHI Thu, 23 May 2019 00:12:07 -0700

Attached is a new version.

At Tue, 21 May 2019 21:29:48 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
<horiguchi.kyot...@lab.ntt.co.jp> wrote in 
<20190521.212948.34357392.horiguchi.kyot...@lab.ntt.co.jp>


> At Mon, 20 May 2019 15:54:30 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI 
> <horiguchi.kyot...@lab.ntt.co.jp> wrote in 
> <20190520.155430.215084510.horiguchi.kyot...@lab.ntt.co.jp>
> > > I suspect the design in the https://postgr.es/m/559fa0ba.3080...@iki.fi 
> > > last
> > > paragraph will be simpler, not more complex.  In the implementation I'm
> > > envisioning, smgrDoPendingDeletes() would change name, perhaps to
> > > AtEOXact_Storage().  For every relfilenode it does not delete, it would 
> > > ensure
> > > durability by syncing (for large nodes) or by WAL-logging each page (for 
> > > small
> > > nodes).  RelationNeedsWAL() would return false whenever the applicable
> > > relfilenode appears in pendingDeletes.  Access methods would remove their
> > > smgrimmedsync() calls, but they would otherwise not change.  Would anyone 
> > > like
> > > to try implementing that?
> > 
> > Following this direction, the attached PoC works *at least for*
> > the wal_optimization TAP tests, but doing pending flush not in
> > smgr but in relcache. This is extending skip-wal feature to
> > indexes. And makes the old 0002 patch on nbtree useless.
> 
> This is a tidier version of the patch.
> 
> - Passes regression tests including 018_wal_optimize.pl
> 
> - Move the substantial work to table/index AMs.
> 
>   Each AM can decide whether to support WAL skip or not.
>   Currently heap and nbtree support it.
> 
> - The timing of sync is moved from AtEOXact to PreCommit. This is
>   because heap_sync() needs xact state = INPROGRESS.
> 
> - matview and cluster is broken, since swapping to new
>   relfilenode doesn't change rd_newRelfilenodeSubid. I'll address
>   that.

cluster/matview are fixed.

A obstacle to fix them was the unreliability of
newRelfilenodeSubid.  As mentioned in the comment of
RelationData, newRelfilenodeSubid may dissapear by certain
sequence of commands.

In the attched v14, I added "rd_firstRelfilenodeSubid", which
stores the subtransaction id where the first relfilenode
replacementin the current transaction. It suivives any sequence
of commands, including one mentioned in CopyFrom's comment (which
I removed by this patch).

With the attached patch, on relations based on table/index AMs
that supports WAL-skipping, WAL-logging is eliminated if the
relation is created in the current transaction, or relfilenode is
replaced in the current transaction. At-commit file sync is
surely performed. (Only Heap and Btree support it.)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

>From 0430cf502bc8d04f3e71cc69a748a9a035706cb6 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyot...@lab.ntt.co.jp>
Date: Thu, 11 Oct 2018 10:03:21 +0900
Subject: [PATCH 1/2] TAP test for copy-truncation optimization.

---
 src/test/recovery/t/018_wal_optimize.pl | 291 ++++++++++++++++++++++++++++++++
 1 file changed, 291 insertions(+)
 create mode 100644 src/test/recovery/t/018_wal_optimize.pl

diff --git a/src/test/recovery/t/018_wal_optimize.pl b/src/test/recovery/t/018_wal_optimize.pl
new file mode 100644
index 0000000000..4fa8be728e
--- /dev/null
+++ b/src/test/recovery/t/018_wal_optimize.pl
@@ -0,0 +1,291 @@
+# Test WAL replay for optimized TRUNCATE and COPY records
+#
+# WAL truncation is optimized in some cases with TRUNCATE and COPY queries
+# which sometimes interact badly with the other optimizations in line with
+# several setting values of wal_level, particularly when using "minimal" or
+# "replica".  The optimization may be enabled or disabled depending on the
+# scenarios dealt here, and should never result in any type of failures or
+# data loss.
+use strict;
+use warnings;
+
+use PostgresNode;
+use TestLib;
+use Test::More tests => 24;
+
+sub check_orphan_relfilenodes
+{
+	my($node, $test_name) = @_;
+
+	my $db_oid = $node->safe_psql('postgres',
+	   "SELECT oid FROM pg_database WHERE datname = 'postgres'");
+	my $prefix = "base/$db_oid/";
+	my $filepaths_referenced = $node->safe_psql('postgres', "
+	   SELECT pg_relation_filepath(oid) FROM pg_class
+	   WHERE reltablespace = 0 and relpersistence <> 't' and
+	   pg_relation_filepath(oid) IS NOT NULL;");
+	is_deeply([sort(map { "$prefix$_" }
+					grep(/^[0-9]+$/,
+						 slurp_dir($node->data_dir . "/$prefix")))],
+			  [sort split /\n/, $filepaths_referenced],
+			  $test_name);
+	return;
+}
+
+# Wrapper routine tunable for wal_level.
+sub run_wal_optimize
+{
+	my $wal_level = shift;
+
+	# Primary needs to have wal_level = minimal here
+	my $node = get_new_node("node_$wal_level");
+	$node->init;
+	$node->append_conf('postgresql.conf', qq(
+wal_level = $wal_level
+));
+	$node->start;
+
+	# Setup
+	my $tablespace_dir = $node->basedir . '/tablespace_other';
+	mkdir ($tablespace_dir);
+	$tablespace_dir = TestLib::real_dir($tablespace_dir);
+	$node->safe_psql('postgres',
+	   "CREATE TABLESPACE other LOCATION '$tablespace_dir';");
+
+	# Test direct truncation optimization.  No tuples
+	$node->safe_psql('postgres', "
+		BEGIN;
+		CREATE TABLE test1 (id serial PRIMARY KEY);
+		TRUNCATE test1;
+		COMMIT;");
+
+	$node->stop('immediate');
+	$node->start;
+
+	my $result = $node->safe_psql('postgres', "SELECT count(*) FROM test1;");
+	is($result, qq(0),
+	   "wal_level = $wal_level, optimized truncation with empty table");
+
+	# Test truncation with inserted tuples within the same transaction.
+	# Tuples inserted after the truncation should be seen.
+	$node->safe_psql('postgres', "
+		BEGIN;
+		CREATE TABLE test2 (id serial PRIMARY KEY);
+		INSERT INTO test2 VALUES (DEFAULT);
+		TRUNCATE test2;
+		INSERT INTO test2 VALUES (DEFAULT);
+		COMMIT;");
+
+	$node->stop('immediate');
+	$node->start;
+
+	$result = $node->safe_psql('postgres', "SELECT count(*) FROM test2;");
+	is($result, qq(1),
+	   "wal_level = $wal_level, optimized truncation with inserted table");
+
+	# Data file for COPY query in follow-up tests.
+	my $basedir = $node->basedir;
+	my $copy_file = "$basedir/copy_data.txt";
+	TestLib::append_to_file($copy_file, qq(20000,30000
+20001,30001
+20002,30002));
+
+	# Test truncation with inserted tuples using COPY.  Tuples copied after the
+	# truncation should be seen.
+	$node->safe_psql('postgres', "
+		BEGIN;
+		CREATE TABLE test3 (id serial PRIMARY KEY, id2 int);
+		INSERT INTO test3 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+		TRUNCATE test3;
+		COPY test3 FROM '$copy_file' DELIMITER ',';
+		COMMIT;");
+	$node->stop('immediate');
+	$node->start;
+	$result = $node->safe_psql('postgres', "SELECT count(*) FROM test3;");
+	is($result, qq(3),
+	   "wal_level = $wal_level, optimized truncation with copied table");
+
+	# Like previous test, but rollback SET TABLESPACE in a subtransaction.
+	$node->safe_psql('postgres', "
+		BEGIN;
+		CREATE TABLE test3a (id serial PRIMARY KEY, id2 int);
+		INSERT INTO test3a (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+		TRUNCATE test3a;
+		SAVEPOINT s; ALTER TABLE test3a SET TABLESPACE other; ROLLBACK TO s;
+		COPY test3a FROM '$copy_file' DELIMITER ',';
+		COMMIT;");
+	$node->stop('immediate');
+	$node->start;
+	$result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+	is($result, qq(3),
+	   "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+	# in different subtransaction patterns
+	$node->safe_psql('postgres', "
+		BEGIN;
+		CREATE TABLE test3a2 (id serial PRIMARY KEY, id2 int);
+		INSERT INTO test3a2 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+		TRUNCATE test3a2;
+		SAVEPOINT s; ALTER TABLE test3a SET TABLESPACE other; RELEASE s;
+		COPY test3a2 FROM '$copy_file' DELIMITER ',';
+		COMMIT;");
+	$node->stop('immediate');
+	$node->start;
+	$result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+	is($result, qq(3),
+	   "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+	$node->safe_psql('postgres', "
+		BEGIN;
+		CREATE TABLE test3a3 (id serial PRIMARY KEY, id2 int);
+		INSERT INTO test3a3 (id, id2) VALUES (DEFAULT, generate_series(1,3000));
+		TRUNCATE test3a3;
+		SAVEPOINT s;
+			ALTER TABLE test3a3 SET TABLESPACE other;
+			SAVEPOINT s2;
+				ALTER TABLE test3a3 SET TABLESPACE pg_default;
+			ROLLBACK TO s2;
+			SAVEPOINT s2;
+				ALTER TABLE test3a3 SET TABLESPACE pg_default;
+			RELEASE s2;
+		ROLLBACK TO s;
+		COPY test3a3 FROM '$copy_file' DELIMITER ',';
+		COMMIT;");
+	$node->stop('immediate');
+	$node->start;
+	$result = $node->safe_psql('postgres', "SELECT count(*) FROM test3a;");
+	is($result, qq(3),
+	   "wal_level = $wal_level, SET TABLESPACE in subtransaction");
+
+	# UPDATE touches two buffers; one is BufferNeedsWAL(); the other is not.
+	$node->safe_psql('postgres', "
+		BEGIN;
+		CREATE TABLE test3b (id serial PRIMARY KEY, id2 int);
+		INSERT INTO test3b (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+		COPY test3b FROM '$copy_file' DELIMITER ',';  -- set sync_above
+		UPDATE test3b SET id2 = id2 + 1;
+		DELETE FROM test3b;
+		COMMIT;");
+	$node->stop('immediate');
+	$node->start;
+	$result = $node->safe_psql('postgres', "SELECT count(*) FROM test3b;");
+	is($result, qq(0),
+	   "wal_level = $wal_level, UPDATE of logged page extends relation");
+
+	# Test truncation with inserted tuples using both INSERT and COPY. Tuples
+	# inserted after the truncation should be seen.
+	$node->safe_psql('postgres', "
+		BEGIN;
+		CREATE TABLE test4 (id serial PRIMARY KEY, id2 int);
+		INSERT INTO test4 (id, id2) VALUES (DEFAULT, generate_series(1,10000));
+		TRUNCATE test4;
+		INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+		COPY test4 FROM '$copy_file' DELIMITER ',';
+		INSERT INTO test4 (id, id2) VALUES (DEFAULT, 10000);
+		COMMIT;");
+
+	$node->stop('immediate');
+	$node->start;
+	$result = $node->safe_psql('postgres', "SELECT count(*) FROM test4;");
+	is($result, qq(5),
+	   "wal_level = $wal_level, optimized truncation with inserted/copied table");
+
+	# Test consistency of COPY with INSERT for table created in the same
+	# transaction.
+	$node->safe_psql('postgres', "
+		BEGIN;
+		CREATE TABLE test5 (id serial PRIMARY KEY, id2 int);
+		INSERT INTO test5 VALUES (DEFAULT, 1);
+		COPY test5 FROM '$copy_file' DELIMITER ',';
+		COMMIT;");
+	$node->stop('immediate');
+	$node->start;
+	$result = $node->safe_psql('postgres', "SELECT count(*) FROM test5;");
+	is($result, qq(4),
+	   "wal_level = $wal_level, replay of optimized copy with inserted table");
+
+	# Test consistency of COPY that inserts more to the same table using
+	# triggers.  If the INSERTS from the trigger go to the same block data
+	# is copied to, and the INSERTs are WAL-logged, WAL replay will fail when
+	# it tries to replay the WAL record but the "before" image doesn't match,
+	# because not all changes were WAL-logged.
+	$node->safe_psql('postgres', "
+		BEGIN;
+		CREATE TABLE test6 (id serial PRIMARY KEY, id2 text);
+		CREATE FUNCTION test6_before_row_trig() RETURNS trigger
+		  LANGUAGE plpgsql as \$\$
+		  BEGIN
+		    IF new.id2 NOT LIKE 'triggered%' THEN
+		      INSERT INTO test6 VALUES (DEFAULT, 'triggered row before' || NEW.id2);
+		    END IF;
+		    RETURN NEW;
+		  END; \$\$;
+		CREATE FUNCTION test6_after_row_trig() RETURNS trigger
+		  LANGUAGE plpgsql as \$\$
+		  BEGIN
+		    IF new.id2 NOT LIKE 'triggered%' THEN
+		      INSERT INTO test6 VALUES (DEFAULT, 'triggered row after' || OLD.id2);
+		    END IF;
+		    RETURN NEW;
+		  END; \$\$;
+		CREATE TRIGGER test6_before_row_insert
+		  BEFORE INSERT ON test6
+		  FOR EACH ROW EXECUTE PROCEDURE test6_before_row_trig();
+		CREATE TRIGGER test6_after_row_insert
+		  AFTER INSERT ON test6
+		  FOR EACH ROW EXECUTE PROCEDURE test6_after_row_trig();
+		COPY test6 FROM '$copy_file' DELIMITER ',';
+		COMMIT;");
+	$node->stop('immediate');
+	$node->start;
+	$result = $node->safe_psql('postgres', "SELECT count(*) FROM test6;");
+	is($result, qq(9),
+	   "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+	# Test consistency of INSERT, COPY and TRUNCATE in same transaction block
+	# with TRUNCATE triggers.
+	$node->safe_psql('postgres', "
+		BEGIN;
+		CREATE TABLE test7 (id serial PRIMARY KEY, id2 text);
+		CREATE FUNCTION test7_before_stat_trig() RETURNS trigger
+		  LANGUAGE plpgsql as \$\$
+		  BEGIN
+		    INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+		    RETURN NULL;
+		  END; \$\$;
+		CREATE FUNCTION test7_after_stat_trig() RETURNS trigger
+		  LANGUAGE plpgsql as \$\$
+		  BEGIN
+		    INSERT INTO test7 VALUES (DEFAULT, 'triggered stat before');
+		    RETURN NULL;
+		  END; \$\$;
+		CREATE TRIGGER test7_before_stat_truncate
+		  BEFORE TRUNCATE ON test7
+		  FOR EACH STATEMENT EXECUTE PROCEDURE test7_before_stat_trig();
+		CREATE TRIGGER test7_after_stat_truncate
+		  AFTER TRUNCATE ON test7
+		  FOR EACH STATEMENT EXECUTE PROCEDURE test7_after_stat_trig();
+		INSERT INTO test7 VALUES (DEFAULT, 1);
+		TRUNCATE test7;
+		COPY test7 FROM '$copy_file' DELIMITER ',';
+		COMMIT;");
+	$node->stop('immediate');
+	$node->start;
+	$result = $node->safe_psql('postgres', "SELECT count(*) FROM test7;");
+	is($result, qq(4),
+	   "wal_level = $wal_level, replay of optimized copy with before trigger");
+
+	# Test redo of temp table creation.
+	$node->safe_psql('postgres', "
+		CREATE TEMP TABLE test8 (id serial PRIMARY KEY, id2 text);");
+	$node->stop('immediate');
+	$node->start;
+
+	check_orphan_relfilenodes($node, "wal_level = $wal_level, no orphan relfilenode remains");
+
+	return;
+}
+
+# Run same test suite for multiple wal_level values.
+run_wal_optimize("minimal");
+run_wal_optimize("replica");
-- 
2.16.3

>From effbb1cdc777e0612a51682dd41f0f46b7881798 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyot...@lab.ntt.co.jp>
Date: Mon, 20 May 2019 15:38:59 +0900
Subject: [PATCH 2/2] Fix WAL skipping feature

WAL-skipping operations mixed with WAL-logged operations can lead to
database corruption after a crash. This patch changes the WAL-skipping
feature so that no data modifcation is WAL-logged at all then sync
such relations at commit.
---
 src/backend/access/brin/brin.c           |   2 +
 src/backend/access/gin/ginutil.c         |   2 +
 src/backend/access/gist/gist.c           |   2 +
 src/backend/access/hash/hash.c           |   2 +
 src/backend/access/heap/heapam.c         |   8 +-
 src/backend/access/heap/heapam_handler.c |  24 ++----
 src/backend/access/heap/rewriteheap.c    |  12 +--
 src/backend/access/index/indexam.c       |  18 +++++
 src/backend/access/nbtree/nbtree.c       |  13 ++++
 src/backend/access/transam/xact.c        |   6 ++
 src/backend/commands/cluster.c           |  29 ++++++++
 src/backend/commands/copy.c              |  38 ++--------
 src/backend/commands/createas.c          |   5 +-
 src/backend/commands/matview.c           |   4 -
 src/backend/commands/tablecmds.c         |  10 +--
 src/backend/utils/cache/relcache.c       | 123 ++++++++++++++++++++++++++++++-
 src/include/access/amapi.h               |   6 ++
 src/include/access/genam.h               |   1 +
 src/include/access/heapam.h              |   1 -
 src/include/access/nbtree.h              |   1 +
 src/include/access/rewriteheap.h         |   2 +-
 src/include/access/tableam.h             |  47 ++++++------
 src/include/utils/rel.h                  |  35 ++++++++-
 src/include/utils/relcache.h             |   4 +
 24 files changed, 289 insertions(+), 106 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index ae7b729edd..4b48f44949 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -125,6 +125,8 @@ brinhandler(PG_FUNCTION_ARGS)
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
 
+	amroutine->amatcommitsync = NULL;
+
 	PG_RETURN_POINTER(amroutine);
 }
 
diff --git a/src/backend/access/gin/ginutil.c b/src/backend/access/gin/ginutil.c
index cf9699ad18..f4f0eebec5 100644
--- a/src/backend/access/gin/ginutil.c
+++ b/src/backend/access/gin/ginutil.c
@@ -77,6 +77,8 @@ ginhandler(PG_FUNCTION_ARGS)
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
 
+	amroutine->amatcommitsync = NULL;
+
 	PG_RETURN_POINTER(amroutine);
 }
 
diff --git a/src/backend/access/gist/gist.c b/src/backend/access/gist/gist.c
index 45c00aaa87..ebaf4495b8 100644
--- a/src/backend/access/gist/gist.c
+++ b/src/backend/access/gist/gist.c
@@ -99,6 +99,8 @@ gisthandler(PG_FUNCTION_ARGS)
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
 
+	amroutine->amatcommitsync = NULL;
+
 	PG_RETURN_POINTER(amroutine);
 }
 
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index e9f2c84af1..ce7ac58204 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -98,6 +98,8 @@ hashhandler(PG_FUNCTION_ARGS)
 	amroutine->aminitparallelscan = NULL;
 	amroutine->amparallelrescan = NULL;
 
+	amroutine->amatcommitsync = NULL;
+
 	PG_RETURN_POINTER(amroutine);
 }
 
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 6c342635e8..642e7d0cc5 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1950,7 +1950,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	MarkBufferDirty(buffer);
 
 	/* XLOG stuff */
-	if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+	if (RelationNeedsWAL(relation))
 	{
 		xl_heap_insert xlrec;
 		xl_heap_header xlhdr;
@@ -2133,7 +2133,7 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 	/* currently not needed (thus unsupported) for heap_multi_insert() */
 	AssertArg(!(options & HEAP_INSERT_NO_LOGICAL));
 
-	needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
+	needwal = RelationNeedsWAL(relation);
 	saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
 												   HEAP_DEFAULT_FILLFACTOR);
 
@@ -8906,10 +8906,6 @@ heap2_redo(XLogReaderState *record)
 void
 heap_sync(Relation rel)
 {
-	/* non-WAL-logged tables never need fsync */
-	if (!RelationNeedsWAL(rel))
-		return;
-
 	/* main heap */
 	FlushRelationBuffers(rel);
 	/* FlushRelationBuffers will have opened rd_smgr */
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index a4a28e88ec..17126e599b 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -559,15 +559,14 @@ tuple_lock_retry:
 	return result;
 }
 
+/* ------------------------------------------------------------------------
+ * WAL-skipping related routine
+ * ------------------------------------------------------------------------
+ */
 static void
-heapam_finish_bulk_insert(Relation relation, int options)
+heapam_at_commit_sync(Relation relation)
 {
-	/*
-	 * If we skipped writing WAL, then we need to sync the heap (but not
-	 * indexes since those use WAL anyway / don't go through tableam)
-	 */
-	if (options & HEAP_INSERT_SKIP_WAL)
-		heap_sync(relation);
+	heap_sync(relation);
 }
 
 
@@ -702,7 +701,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 	IndexScanDesc indexScan;
 	TableScanDesc tableScan;
 	HeapScanDesc heapScan;
-	bool		use_wal;
 	bool		is_system_catalog;
 	Tuplesortstate *tuplesort;
 	TupleDesc	oldTupDesc = RelationGetDescr(OldHeap);
@@ -716,12 +714,6 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 	/* Remember if it's a system catalog */
 	is_system_catalog = IsSystemRelation(OldHeap);
 
-	/*
-	 * We need to log the copied data in WAL iff WAL archiving/streaming is
-	 * enabled AND it's a WAL-logged rel.
-	 */
-	use_wal = XLogIsNeeded() && RelationNeedsWAL(NewHeap);
-
 	/* use_wal off requires smgr_targblock be initially invalid */
 	Assert(RelationGetTargetBlock(NewHeap) == InvalidBlockNumber);
 
@@ -732,7 +724,7 @@ heapam_relation_copy_for_cluster(Relation OldHeap, Relation NewHeap,
 
 	/* Initialize the rewrite operation */
 	rwstate = begin_heap_rewrite(OldHeap, NewHeap, OldestXmin, *xid_cutoff,
-								 *multi_cutoff, use_wal);
+								 *multi_cutoff);
 
 
 	/* Set up sorting if wanted */
@@ -2626,7 +2618,7 @@ static const TableAmRoutine heapam_methods = {
 	.tuple_delete = heapam_tuple_delete,
 	.tuple_update = heapam_tuple_update,
 	.tuple_lock = heapam_tuple_lock,
-	.finish_bulk_insert = heapam_finish_bulk_insert,
+	.at_commit_sync = heapam_at_commit_sync,
 
 	.tuple_fetch_row_version = heapam_fetch_row_version,
 	.tuple_get_latest_tid = heap_get_latest_tid,
diff --git a/src/backend/access/heap/rewriteheap.c b/src/backend/access/heap/rewriteheap.c
index 131ec7b8d7..617eec582b 100644
--- a/src/backend/access/heap/rewriteheap.c
+++ b/src/backend/access/heap/rewriteheap.c
@@ -144,7 +144,6 @@ typedef struct RewriteStateData
 	Page		rs_buffer;		/* page currently being built */
 	BlockNumber rs_blockno;		/* block where page will go */
 	bool		rs_buffer_valid;	/* T if any tuples in buffer */
-	bool		rs_use_wal;		/* must we WAL-log inserts? */
 	bool		rs_logical_rewrite; /* do we need to do logical rewriting */
 	TransactionId rs_oldest_xmin;	/* oldest xmin used by caller to determine
 									 * tuple visibility */
@@ -245,8 +244,7 @@ static void logical_end_heap_rewrite(RewriteState state);
  */
 RewriteState
 begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xmin,
-				   TransactionId freeze_xid, MultiXactId cutoff_multi,
-				   bool use_wal)
+				   TransactionId freeze_xid, MultiXactId cutoff_multi)
 {
 	RewriteState state;
 	MemoryContext rw_cxt;
@@ -271,7 +269,6 @@ begin_heap_rewrite(Relation old_heap, Relation new_heap, TransactionId oldest_xm
 	/* new_heap needn't be empty, just locked */
 	state->rs_blockno = RelationGetNumberOfBlocks(new_heap);
 	state->rs_buffer_valid = false;
-	state->rs_use_wal = use_wal;
 	state->rs_oldest_xmin = oldest_xmin;
 	state->rs_freeze_xid = freeze_xid;
 	state->rs_cutoff_multi = cutoff_multi;
@@ -330,7 +327,7 @@ end_heap_rewrite(RewriteState state)
 	/* Write the last page, if any */
 	if (state->rs_buffer_valid)
 	{
-		if (state->rs_use_wal)
+		if (RelationNeedsWAL(state->rs_new_rel))
 			log_newpage(&state->rs_new_rel->rd_node,
 						MAIN_FORKNUM,
 						state->rs_blockno,
@@ -654,9 +651,6 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
 	{
 		int			options = HEAP_INSERT_SKIP_FSM;
 
-		if (!state->rs_use_wal)
-			options |= HEAP_INSERT_SKIP_WAL;
-
 		/*
 		 * While rewriting the heap for VACUUM FULL / CLUSTER, make sure data
 		 * for the TOAST table are not logically decoded.  The main heap is
@@ -695,7 +689,7 @@ raw_heap_insert(RewriteState state, HeapTuple tup)
 			/* Doesn't fit, so write out the existing page */
 
 			/* XLOG stuff */
-			if (state->rs_use_wal)
+			if (RelationNeedsWAL(state->rs_new_rel))
 				log_newpage(&state->rs_new_rel->rd_node,
 							MAIN_FORKNUM,
 							state->rs_blockno,
diff --git a/src/backend/access/index/indexam.c b/src/backend/access/index/indexam.c
index aefdd2916d..ade721a383 100644
--- a/src/backend/access/index/indexam.c
+++ b/src/backend/access/index/indexam.c
@@ -33,6 +33,7 @@
  *		index_can_return	- does index support index-only scans?
  *		index_getprocid - get a support procedure OID
  *		index_getprocinfo - get a support procedure's lookup info
+ *		index_at_commit_sync - perform at_commit_sync
  *
  * NOTES
  *		This file contains the index_ routines which used
@@ -837,6 +838,23 @@ index_getprocinfo(Relation irel,
 	return locinfo;
 }
 
+/* ----------------
+ *		index_at_commit_sync
+ *
+ *  An index AM that defines this interface can allow derived objects created
+ *  in the current transaction to skip WAL-logging. This routine is called
+ *  commit-time and the AM must flush buffer and sync the underlying storage.
+ *
+ *  Optional interface
+ *  ----------------
+ */
+void
+index_at_commit_sync(Relation irel)
+{
+	if (irel->rd_indam && irel->rd_indam->amatcommitsync)
+		irel->rd_indam->amatcommitsync(irel);
+}
+
 /* ----------------
  *		index_store_float8_orderby_distances
  *
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 85e54ac44b..695b058b85 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -147,6 +147,8 @@ bthandler(PG_FUNCTION_ARGS)
 	amroutine->aminitparallelscan = btinitparallelscan;
 	amroutine->amparallelrescan = btparallelrescan;
 
+	amroutine->amatcommitsync = btatcommitsync;
+
 	PG_RETURN_POINTER(amroutine);
 }
 
@@ -1385,3 +1387,14 @@ btcanreturn(Relation index, int attno)
 {
 	return true;
 }
+
+/*
+ *	btatcommitsync() -- Perform at-commit sync of WAL-skipped index
+ */
+void
+btatcommitsync(Relation index)
+{
+	FlushRelationBuffers(index);
+	smgrimmedsync(index->rd_smgr, MAIN_FORKNUM);
+}
+
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index f1108ccc8b..0670985bc2 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2120,6 +2120,9 @@ CommitTransaction(void)
 	if (!is_parallel_worker)
 		PreCommit_CheckForSerializationFailure();
 
+	/* Sync WAL-skipped relations */
+	PreCommit_RelationSync();
+
 	/*
 	 * Insert notifications sent by NOTIFY commands into the queue.  This
 	 * should be late in the pre-commit sequence to minimize time spent
@@ -2395,6 +2398,9 @@ PrepareTransaction(void)
 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 				 errmsg("cannot PREPARE a transaction that has manipulated logical replication workers")));
 
+	/* Sync WAL-skipped relations */
+	PreCommit_RelationSync();
+
 	/* Prevent cancel/die interrupt while cleaning up */
 	HOLD_INTERRUPTS();
 
diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index ebaec4f8dd..504a04104f 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -1034,12 +1034,41 @@ swap_relation_files(Oid r1, Oid r2, bool target_is_pg_class,
 
 	if (OidIsValid(relfilenode1) && OidIsValid(relfilenode2))
 	{
+		Relation rel1;
+		Relation rel2;
+
 		/*
 		 * Normal non-mapped relations: swap relfilenodes, reltablespaces,
 		 * relpersistence
 		 */
 		Assert(!target_is_pg_class);
 
+		/* Update creation subid hints of relcache */
+		rel1 = relation_open(r1, ExclusiveLock);
+		rel2 = relation_open(r2, ExclusiveLock);
+
+		/*
+		 * New relation's relfilenode is created in the current transaction
+		 * and used as old ralation's new relfilenode. So its
+		 * newRelfilenodeSubid as new relation's createSubid. We don't fix
+		 * rel2 since it would be deleted soon.
+		 */
+		Assert(rel2->rd_createSubid != InvalidSubTransactionId);
+		rel1->rd_newRelfilenodeSubid = rel2->rd_createSubid;
+
+		/* record the first relfilenode change in the current transaction */
+		if (rel1->rd_firstRelfilenodeSubid == InvalidSubTransactionId)
+		{
+			rel1->rd_firstRelfilenodeSubid = GetCurrentSubTransactionId();
+
+			/* Flag the old relation as needing eoxact cleanup */
+			RelationEOXactListAdd(rel1);
+		}
+
+		relation_close(rel1, ExclusiveLock);
+		relation_close(rel2, ExclusiveLock);
+
+		/* swap relfilenodes, reltablespaces, relpersistence */
 		swaptemp = relform1->relfilenode;
 		relform1->relfilenode = relform2->relfilenode;
 		relform2->relfilenode = swaptemp;
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index b00891ffd2..77608c09c3 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2720,28 +2720,9 @@ CopyFrom(CopyState cstate)
 	 * If it does commit, we'll have done the table_finish_bulk_insert() at
 	 * the bottom of this routine first.
 	 *
-	 * As mentioned in comments in utils/rel.h, the in-same-transaction test
-	 * is not always set correctly, since in rare cases rd_newRelfilenodeSubid
-	 * can be cleared before the end of the transaction. The exact case is
-	 * when a relation sets a new relfilenode twice in same transaction, yet
-	 * the second one fails in an aborted subtransaction, e.g.
-	 *
-	 * BEGIN;
-	 * TRUNCATE t;
-	 * SAVEPOINT save;
-	 * TRUNCATE t;
-	 * ROLLBACK TO save;
-	 * COPY ...
-	 *
-	 * Also, if the target file is new-in-transaction, we assume that checking
-	 * FSM for free space is a waste of time, even if we must use WAL because
-	 * of archiving.  This could possibly be wrong, but it's unlikely.
-	 *
-	 * The comments for table_insert and RelationGetBufferForTuple specify that
-	 * skipping WAL logging is only safe if we ensure that our tuples do not
-	 * go into pages containing tuples from any other transactions --- but this
-	 * must be the case if we have a new table or new relfilenode, so we need
-	 * no additional work to enforce that.
+	 * If the target file is new-in-transaction, we assume that checking FSM
+	 * for free space is a waste of time, even if we must use WAL because of
+	 * archiving.  This could possibly be wrong, but it's unlikely.
 	 *
 	 * We currently don't support this optimization if the COPY target is a
 	 * partitioned table as we currently only lazily initialize partition
@@ -2757,15 +2738,14 @@ CopyFrom(CopyState cstate)
 	 * are not supported as per the description above.
 	 *----------
 	 */
-	/* createSubid is creation check, newRelfilenodeSubid is truncation check */
+	/*
+	 * createSubid is creation check, firstRelfilenodeSubid is truncation and
+	 * cluster check. Partitioned table doesn't have storage.
+	 */
 	if (RELKIND_HAS_STORAGE(cstate->rel->rd_rel->relkind) &&
 		(cstate->rel->rd_createSubid != InvalidSubTransactionId ||
-		 cstate->rel->rd_newRelfilenodeSubid != InvalidSubTransactionId))
-	{
+		 cstate->rel->rd_firstRelfilenodeSubid != InvalidSubTransactionId))
 		ti_options |= TABLE_INSERT_SKIP_FSM;
-		if (!XLogIsNeeded())
-			ti_options |= TABLE_INSERT_SKIP_WAL;
-	}
 
 	/*
 	 * Optimize if new relfilenode was created in this subxact or one of its
@@ -3364,8 +3344,6 @@ CopyFrom(CopyState cstate)
 
 	FreeExecutorState(estate);
 
-	table_finish_bulk_insert(cstate->rel, ti_options);
-
 	return processed;
 }
 
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 43c2fa9124..859b869b0d 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -558,8 +558,7 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
 	 * We can skip WAL-logging the insertions, unless PITR or streaming
 	 * replication is in use. We can skip the FSM in any case.
 	 */
-	myState->ti_options = TABLE_INSERT_SKIP_FSM |
-		(XLogIsNeeded() ? 0 : TABLE_INSERT_SKIP_WAL);
+	myState->ti_options = TABLE_INSERT_SKIP_FSM;
 	myState->bistate = GetBulkInsertState();
 
 	/* Not using WAL requires smgr_targblock be initially invalid */
@@ -604,8 +603,6 @@ intorel_shutdown(DestReceiver *self)
 
 	FreeBulkInsertState(myState->bistate);
 
-	table_finish_bulk_insert(myState->rel, myState->ti_options);
-
 	/* close rel, but keep lock until commit */
 	table_close(myState->rel, NoLock);
 	myState->rel = NULL;
diff --git a/src/backend/commands/matview.c b/src/backend/commands/matview.c
index dc2940cd4e..583c542121 100644
--- a/src/backend/commands/matview.c
+++ b/src/backend/commands/matview.c
@@ -463,8 +463,6 @@ transientrel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
 	 * replication is in use. We can skip the FSM in any case.
 	 */
 	myState->ti_options = TABLE_INSERT_SKIP_FSM | TABLE_INSERT_FROZEN;
-	if (!XLogIsNeeded())
-		myState->ti_options |= TABLE_INSERT_SKIP_WAL;
 	myState->bistate = GetBulkInsertState();
 
 	/* Not using WAL requires smgr_targblock be initially invalid */
@@ -509,8 +507,6 @@ transientrel_shutdown(DestReceiver *self)
 
 	FreeBulkInsertState(myState->bistate);
 
-	table_finish_bulk_insert(myState->transientrel, myState->ti_options);
-
 	/* close transientrel, but keep lock until commit */
 	table_close(myState->transientrel, NoLock);
 	myState->transientrel = NULL;
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 602a8dbd1c..f63662f4ed 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -4733,9 +4733,9 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
 
 	/*
 	 * Prepare a BulkInsertState and options for table_insert. Because we're
-	 * building a new heap, we can skip WAL-logging and fsync it to disk at
-	 * the end instead (unless WAL-logging is required for archiving or
-	 * streaming replication). The FSM is empty too, so don't bother using it.
+	 * building a new heap, the underlying table AM can skip WAL-logging and
+	 * fsync the relation to disk at the end of the current transaction
+	 * instead. The FSM is empty too, so don't bother using it.
 	 */
 	if (newrel)
 	{
@@ -4743,8 +4743,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
 		bistate = GetBulkInsertState();
 
 		ti_options = TABLE_INSERT_SKIP_FSM;
-		if (!XLogIsNeeded())
-			ti_options |= TABLE_INSERT_SKIP_WAL;
 	}
 	else
 	{
@@ -5028,8 +5026,6 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap, LOCKMODE lockmode)
 	{
 		FreeBulkInsertState(bistate);
 
-		table_finish_bulk_insert(newrel, ti_options);
-
 		table_close(newrel, NoLock);
 	}
 }
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 2b992d7832..cd418c5f80 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -177,6 +177,13 @@ static bool eoxact_list_overflowed = false;
 			eoxact_list_overflowed = true; \
 	} while (0)
 
+/* Function version of the macro above */
+void
+RelationEOXactListAdd(Relation rel)
+{
+	EOXactListAdd(rel);
+}
+
 /*
  * EOXactTupleDescArray stores TupleDescs that (might) need AtEOXact
  * cleanup work.  The array expands as needed; there is no hashtable because
@@ -263,6 +270,7 @@ static void RelationReloadIndexInfo(Relation relation);
 static void RelationReloadNailed(Relation relation);
 static void RelationFlushRelation(Relation relation);
 static void RememberToFreeTupleDescAtEOX(TupleDesc td);
+static void PreCommit_SyncOneRelation(Relation relation);
 static void AtEOXact_cleanup(Relation relation, bool isCommit);
 static void AtEOSubXact_cleanup(Relation relation, bool isCommit,
 								SubTransactionId mySubid, SubTransactionId parentSubid);
@@ -1512,6 +1520,10 @@ RelationInitIndexAccessInfo(Relation relation)
 	relation->rd_exclprocs = NULL;
 	relation->rd_exclstrats = NULL;
 	relation->rd_amcache = NULL;
+
+	/* set AM-type-independent WAL-skip flag if this am supports it */
+	if (relation->rd_indam->amatcommitsync != NULL)
+		relation->rd_can_skipwal = true;
 }
 
 /*
@@ -1781,6 +1793,10 @@ RelationInitTableAccessMethod(Relation relation)
 	 * Now we can fetch the table AM's API struct
 	 */
 	InitTableAmRoutine(relation);
+
+	/* set AM-type-independent WAL-skip flag if this am supports it */
+	if (relation->rd_tableam && relation->rd_tableam->at_commit_sync)
+		relation->rd_can_skipwal = true;
 }
 
 /*
@@ -2594,6 +2610,7 @@ RelationClearRelation(Relation relation, bool rebuild)
 		/* creation sub-XIDs must be preserved */
 		SWAPFIELD(SubTransactionId, rd_createSubid);
 		SWAPFIELD(SubTransactionId, rd_newRelfilenodeSubid);
+		SWAPFIELD(SubTransactionId, rd_firstRelfilenodeSubid);
 		/* un-swap rd_rel pointers, swap contents instead */
 		SWAPFIELD(Form_pg_class, rd_rel);
 		/* ... but actually, we don't have to update newrel->rd_rel */
@@ -2661,7 +2678,7 @@ static void
 RelationFlushRelation(Relation relation)
 {
 	if (relation->rd_createSubid != InvalidSubTransactionId ||
-		relation->rd_newRelfilenodeSubid != InvalidSubTransactionId)
+		relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
 	{
 		/*
 		 * New relcache entries are always rebuilt, not flushed; else we'd
@@ -2801,7 +2818,7 @@ RelationCacheInvalidate(void)
 		 * pending invalidations.
 		 */
 		if (relation->rd_createSubid != InvalidSubTransactionId ||
-			relation->rd_newRelfilenodeSubid != InvalidSubTransactionId)
+			relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId)
 			continue;
 
 		relcacheInvalsReceived++;
@@ -2913,6 +2930,93 @@ RememberToFreeTupleDescAtEOX(TupleDesc td)
 	EOXactTupleDescArray[NextEOXactTupleDescNum++] = td;
 }
 
+/*
+ * PreCommit_RelationSync
+ *
+ *	Sync relations that were WAL-skipped in this transaction .
+ *
+ * Access method may have skipped WAL-logging for relations created in the
+ * current transaction. Such relations need to be synced at top-transaction's
+ * commit.  The operation requires active transaction state, so separately
+ * performed from AtEOXact_RelationCache.
+ */
+void
+PreCommit_RelationSync(void)
+{
+	HASH_SEQ_STATUS status;
+	RelIdCacheEnt *idhentry;
+	int			i;
+
+	/* See AtEOXact_RelationCache about eoxact_list */
+	if (eoxact_list_overflowed)
+	{
+		hash_seq_init(&status, RelationIdCache);
+		while ((idhentry = (RelIdCacheEnt *) hash_seq_search(&status)) != NULL)
+			PreCommit_SyncOneRelation(idhentry->reldesc);
+	}
+	else
+	{
+		for (i = 0; i < eoxact_list_len; i++)
+		{
+			idhentry = (RelIdCacheEnt *) hash_search(RelationIdCache,
+													 (void *) &eoxact_list[i],
+													 HASH_FIND,
+													 NULL);
+
+			if (idhentry != NULL)
+				PreCommit_SyncOneRelation(idhentry->reldesc);
+		}
+	}
+}
+
+/*
+ * PreCommit_SyncOneRelation
+ *
+ *	Sync one relation if needed
+ *
+ * NB: this processing must be idempotent, because EOXactListAdd() doesn't
+ * bother to prevent duplicate entries in eoxact_list[].
+ */
+static void
+PreCommit_SyncOneRelation(Relation relation)
+{
+	HeapTuple reltup;
+	Form_pg_class relform;
+
+	/* return immediately if no need for sync */
+	if (!RelationNeedsAtCommitSync(relation))
+		return;
+
+	/*
+	 * We are about to sync a WAL-skipped relation. The relfilenode here is
+	 * wrong if the last sub transaction that created new relfilenode was
+	 * aborted.
+	 */
+	if (relation->rd_firstRelfilenodeSubid != InvalidSubTransactionId &&
+		relation->rd_newRelfilenodeSubid == InvalidSubTransactionId)
+	{
+		reltup = SearchSysCache1(RELOID, ObjectIdGetDatum(relation->rd_id));
+		if (!HeapTupleIsValid(reltup))
+			elog(ERROR, "cache lookup failed for relation %u", relation->rd_id);
+		relform = (Form_pg_class) GETSTRUCT(reltup);
+		relation->rd_rel->relfilenode = relform->relfilenode;
+		relation->rd_node.relNode = relform->relfilenode;
+		ReleaseSysCache(reltup);
+	}
+
+	if (relation->rd_tableam != NULL)
+		table_at_commit_sync(relation);
+	else
+	{
+		Assert (relation->rd_indam != NULL);
+		table_at_commit_sync(relation);
+	}
+
+	/* We have synced the files, forget about relfilenode change */
+	relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
+	relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+}
+
 /*
  * AtEOXact_RelationCache
  *
@@ -3058,6 +3162,7 @@ AtEOXact_cleanup(Relation relation, bool isCommit)
 	 * Likewise, reset the hint about the relfilenode being new.
 	 */
 	relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+	relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
 }
 
 /*
@@ -3149,7 +3254,7 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
 	}
 
 	/*
-	 * Likewise, update or drop any new-relfilenode-in-subtransaction hint.
+	 * Likewise, update or drop any new-relfilenode-in-subtransaction hints.
 	 */
 	if (relation->rd_newRelfilenodeSubid == mySubid)
 	{
@@ -3158,6 +3263,14 @@ AtEOSubXact_cleanup(Relation relation, bool isCommit,
 		else
 			relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
 	}
+
+	if (relation->rd_firstRelfilenodeSubid == mySubid)
+	{
+		if (isCommit)
+			relation->rd_firstRelfilenodeSubid = parentSubid;
+		else
+			relation->rd_firstRelfilenodeSubid = InvalidSubTransactionId;
+	}
 }
 
 
@@ -3440,6 +3553,10 @@ RelationSetNewRelfilenode(Relation relation, char persistence)
 	 */
 	RelationDropStorage(relation);
 
+	/* Record the subxid where the first relfilenode change happen */
+	if (relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)
+		relation->rd_firstRelfilenodeSubid = GetCurrentSubTransactionId();
+
 	/*
 	 * Create storage for the main fork of the new relfilenode.  If it's a
 	 * table-like object, call into the table AM to do so, which'll also
diff --git a/src/include/access/amapi.h b/src/include/access/amapi.h
index 6e3db06eed..75159d10d4 100644
--- a/src/include/access/amapi.h
+++ b/src/include/access/amapi.h
@@ -156,6 +156,9 @@ typedef void (*aminitparallelscan_function) (void *target);
 /* (re)start parallel index scan */
 typedef void (*amparallelrescan_function) (IndexScanDesc scan);
 
+/* sync relation at commit after skipping WAL-logging */
+typedef void (*amatcommitsync_function) (Relation indexRelation);
+	
 /*
  * API struct for an index AM.  Note this must be stored in a single palloc'd
  * chunk of memory.
@@ -230,6 +233,9 @@ typedef struct IndexAmRoutine
 	amestimateparallelscan_function amestimateparallelscan; /* can be NULL */
 	aminitparallelscan_function aminitparallelscan; /* can be NULL */
 	amparallelrescan_function amparallelrescan; /* can be NULL */
+
+	/* interface function to do at-commit sync after skipping WAL-logging */
+	amatcommitsync_function amatcommitsync; /* can be NULL */;
 } IndexAmRoutine;
 
 
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 8c053be2ca..8e661edfdd 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -177,6 +177,7 @@ extern RegProcedure index_getprocid(Relation irel, AttrNumber attnum,
 									uint16 procnum);
 extern FmgrInfo *index_getprocinfo(Relation irel, AttrNumber attnum,
 								   uint16 procnum);
+extern void index_at_commit_sync(Relation irel);
 extern void index_store_float8_orderby_distances(IndexScanDesc scan,
 												 Oid *orderByTypes, double *distances,
 												 bool recheckOrderBy);
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index b88bd8a4d7..187c668878 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -29,7 +29,6 @@
 
 
 /* "options" flag bits for heap_insert */
-#define HEAP_INSERT_SKIP_WAL	TABLE_INSERT_SKIP_WAL
 #define HEAP_INSERT_SKIP_FSM	TABLE_INSERT_SKIP_FSM
 #define HEAP_INSERT_FROZEN		TABLE_INSERT_FROZEN
 #define HEAP_INSERT_NO_LOGICAL	TABLE_INSERT_NO_LOGICAL
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index a3583f225b..f33d2b38b5 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -717,6 +717,7 @@ extern IndexBulkDeleteResult *btbulkdelete(IndexVacuumInfo *info,
 extern IndexBulkDeleteResult *btvacuumcleanup(IndexVacuumInfo *info,
 											  IndexBulkDeleteResult *stats);
 extern bool btcanreturn(Relation index, int attno);
+extern void btatcommitsync(Relation index);
 
 /*
  * prototypes for internal functions in nbtree.c
diff --git a/src/include/access/rewriteheap.h b/src/include/access/rewriteheap.h
index 8056253916..7f9736e294 100644
--- a/src/include/access/rewriteheap.h
+++ b/src/include/access/rewriteheap.h
@@ -23,7 +23,7 @@ typedef struct RewriteStateData *RewriteState;
 
 extern RewriteState begin_heap_rewrite(Relation OldHeap, Relation NewHeap,
 									   TransactionId OldestXmin, TransactionId FreezeXid,
-									   MultiXactId MultiXactCutoff, bool use_wal);
+									   MultiXactId MultiXactCutoff);
 extern void end_heap_rewrite(RewriteState state);
 extern void rewrite_heap_tuple(RewriteState state, HeapTuple oldTuple,
 							   HeapTuple newTuple);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 6f1cd382d8..759a1e806d 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -409,19 +409,15 @@ typedef struct TableAmRoutine
 							   TM_FailureData *tmfd);
 
 	/*
-	 * Perform operations necessary to complete insertions made via
-	 * tuple_insert and multi_insert with a BulkInsertState specified. This
-	 * may for example be used to flush the relation, when the
-	 * TABLE_INSERT_SKIP_WAL option was used.
+	 * Sync relation at commit-time after skipping WAL-logging.
 	 *
-	 * Typically callers of tuple_insert and multi_insert will just pass all
-	 * the flags that apply to them, and each AM has to decide which of them
-	 * make sense for it, and then only take actions in finish_bulk_insert for
-	 * those flags, and ignore others.
+	 *  A table AM may skip WAL-logging for relations created in the current
+	 *  transaction. This routine is called commit-time and the table AM
+	 *  must flush buffer and sync the underlying storage.
 	 *
 	 * Optional callback.
 	 */
-	void		(*finish_bulk_insert) (Relation rel, int options);
+	void		(*at_commit_sync) (Relation rel);
 
 
 	/* ------------------------------------------------------------------------
@@ -1089,10 +1085,6 @@ table_compute_xid_horizon_for_tuples(Relation rel,
  * behaviour of the AM. Several options might be ignored by AMs not supporting
  * them.
  *
- * If the TABLE_INSERT_SKIP_WAL option is specified, the new tuple doesn't
- * need to be logged to WAL, even for a non-temp relation. It is the AMs
- * choice whether this optimization is supported.
- *
  * If the TABLE_INSERT_SKIP_FSM option is specified, AMs are free to not reuse
  * free space in the relation. This can save some cycles when we know the
  * relation is new and doesn't contain useful amounts of free space.  It's
@@ -1112,10 +1104,12 @@ table_compute_xid_horizon_for_tuples(Relation rel,
  * Note that most of these options will be applied when inserting into the
  * heap's TOAST table, too, if the tuple requires any out-of-line data.
  *
+ * The core function RelationNeedsWAL() considers skipping WAL-logging on
+ * relations created in-transaction or truncated when the AM provides
+ * at_commit_sync interface.
  *
  * The BulkInsertState object (if any; bistate can be NULL for default
- * behavior) is also just passed through to RelationGetBufferForTuple. If
- * `bistate` is provided, table_finish_bulk_insert() needs to be called.
+ * behavior) is also just passed through to RelationGetBufferForTuple.
  *
  * On return the slot's tts_tid and tts_tableOid are updated to reflect the
  * insertion. But note that any toasting of fields within the slot is NOT
@@ -1205,6 +1199,8 @@ table_multi_insert(Relation rel, TupleTableSlot **slots, int nslots,
  * delete it.  Failure return codes are TM_SelfModified, TM_Updated, and
  * TM_BeingModified (the last only possible if wait == false).
  *
+ * See table_insert about skipping WAL-logging feature.
+ *
  * In the failure cases, the routine fills *tmfd with the tuple's t_ctid,
  * t_xmax, and, if possible, and, if possible, t_cmax.  See comments for
  * struct TM_FailureData for additional info.
@@ -1249,6 +1245,8 @@ table_delete(Relation rel, ItemPointer tid, CommandId cid,
  * update was done.  However, any TOAST changes in the new tuple's
  * data are not reflected into *newtup.
  *
+ * See table_insert about skipping WAL-logging feature.
+ *
  * In the failure cases, the routine fills *tmfd with the tuple's t_ctid,
  * t_xmax, and, if possible, t_cmax.  See comments for struct TM_FailureData
  * for additional info.
@@ -1310,20 +1308,23 @@ table_lock_tuple(Relation rel, ItemPointer tid, Snapshot snapshot,
 }
 
 /*
- * Perform operations necessary to complete insertions made via
- * tuple_insert and multi_insert with a BulkInsertState specified. This
- * e.g. may e.g. used to flush the relation when inserting with
- * TABLE_INSERT_SKIP_WAL specified.
+ * Sync relation at commit-time if needed.
+ *
+ *  A table AM that defines this interface can allow derived objects created
+ *  in the current transaction to skip WAL-logging. This routine is called
+ *  commit-time and the table AM must flush buffer and sync the underlying
+ *  storage.
+ *
+ * Optional callback.
  */
 static inline void
-table_finish_bulk_insert(Relation rel, int options)
+table_at_commit_sync(Relation rel)
 {
 	/* optional callback */
-	if (rel->rd_tableam && rel->rd_tableam->finish_bulk_insert)
-		rel->rd_tableam->finish_bulk_insert(rel, options);
+	if (rel->rd_tableam && rel->rd_tableam->at_commit_sync)
+		rel->rd_tableam->at_commit_sync(rel);
 }
 
-
 /* ------------------------------------------------------------------------
  * DDL related functionality.
  * ------------------------------------------------------------------------
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index d7f33abce3..6a3ef80575 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -63,6 +63,7 @@ typedef struct RelationData
 	bool		rd_indexvalid;	/* is rd_indexlist valid? (also rd_pkindex and
 								 * rd_replidindex) */
 	bool		rd_statvalid;	/* is rd_statlist valid? */
+	bool		rd_can_skipwal; /* underlying AM allow WAL-logging?  */
 
 	/*
 	 * rd_createSubid is the ID of the highest subtransaction the rel has
@@ -76,10 +77,17 @@ typedef struct RelationData
 	 * transaction, with one of them occurring in a subsequently aborted
 	 * subtransaction, e.g. BEGIN; TRUNCATE t; SAVEPOINT save; TRUNCATE t;
 	 * ROLLBACK TO save; -- rd_newRelfilenode is now forgotten
+	 * rd_firstRelfilenodeSubid is the ID of the hightest subtransaction the
+	 * relfilenode change has took place first in the current
+	 * transaction. This won't be forgotten as newRelfilenodeSubid is. A valid
+	 * OID means that the currently active relfilenode is transaction-local
+	 * and no-need for WAL-logging.
 	 */
 	SubTransactionId rd_createSubid;	/* rel was created in current xact */
 	SubTransactionId rd_newRelfilenodeSubid;	/* new relfilenode assigned in
 												 * current xact */
+	SubTransactionId rd_firstRelfilenodeSubid;	/* new relfilenode assigned
+												 * first in current xact */
 
 	Form_pg_class rd_rel;		/* RELATION tuple */
 	TupleDesc	rd_att;			/* tuple descriptor */
@@ -512,9 +520,32 @@ typedef struct ViewOptions
 /*
  * RelationNeedsWAL
  *		True if relation needs WAL.
+ *
+ * If underlying AM supports WAL-skipping feature, returns false if wal_level
+ * = minimal and this relation is created or truncated in the current
+ * transaction.
  */
-#define RelationNeedsWAL(relation) \
-	((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
+#define RelationNeedsWAL(relation)										\
+	((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT &&	\
+	 (!relation->rd_can_skipwal ||										\
+	  XLogIsNeeded() ||													\
+	  (relation->rd_createSubid == InvalidSubTransactionId &&			\
+	   relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)))
+
+/*
+ * RelationNeedsAtCommitSync
+ *      True if relation needs at-commit sync
+ *
+ * This macro is used in few places but written here because it is tightly
+ * related with RelationNeedsWAL() above. We don't need to sync local or temp
+ * relations.
+ */
+#define RelationNeedsAtCommitSync(relation) \
+	((relation)->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT &&	\
+	 !(!relation->rd_can_skipwal ||										\
+	   XLogIsNeeded() ||												\
+	   (relation->rd_createSubid == InvalidSubTransactionId &&			\
+		relation->rd_firstRelfilenodeSubid == InvalidSubTransactionId)))
 
 /*
  * RelationUsesLocalBuffers
diff --git a/src/include/utils/relcache.h b/src/include/utils/relcache.h
index d9c10ffcba..b681d3afb2 100644
--- a/src/include/utils/relcache.h
+++ b/src/include/utils/relcache.h
@@ -120,6 +120,7 @@ extern void RelationCacheInvalidate(void);
 
 extern void RelationCloseSmgrByOid(Oid relationId);
 
+extern void PreCommit_RelationSync(void);
 extern void AtEOXact_RelationCache(bool isCommit);
 extern void AtEOSubXact_RelationCache(bool isCommit, SubTransactionId mySubid,
 									  SubTransactionId parentSubid);
@@ -138,4 +139,7 @@ extern bool criticalRelcachesBuilt;
 /* should be used only by relcache.c and postinit.c */
 extern bool criticalSharedRelcachesBuilt;
 
+/* add rel to eoxact cleanup list */
+void RelationEOXactListAdd(Relation rel);
+
 #endif							/* RELCACHE_H */
-- 
2.16.3

Re: [HACKERS] WAL logging problem in 9.4.3?

Reply via email to