On 10.12.2012 03:52, Kyotaro HORIGUCHI wrote:
I think that minRecoveryPoint should be updated before the data-file
changes, so XLogFlush() should be called before smgrtruncate(). No?

Mmm.. As far as I saw in xact_redo_commit_internal, XLogFlush
seems should be done AFTER redo substantial operation. Is it
wrong? If so, I suppose xact_redo_commit_internal could shold be
fixed in the same way.

So, two options:

1. Redo truncation, then XLogFlush()

There's a window where the original problem could still occur: The file is truncated, and you crash before XLogFlush() finishes.

2. XLogFlush(), then redo truncation.

If the truncation fails for some reason (disk failure?) and you have already updated minRecoveryPoint, you can not start up until you somehow fix the problem so that the truncation can succeed, and then finish recovery.

Both options have their merits. The window in 1. is very small, and in 2., the probability that an unlink() or truncation fails is very small as well.

We've chosen 1. in xact_redo_commit_internal(), but I don't think that was a carefully made decision, it just imitates what happens in the corresponding non-redo commit path. In xact_redo_commit_internal(), it makes sense to do XLogFlush() afterwards, for CREATE DATABASE and CREATE TABLESPACE. Those create files, and if you e.g run out of disk space, or non-existent path, you don't want minRecoveryPoint to be updated yet. Otherwise you can no longer recover to the point just before the creation of those files. But in case of DROP DATABASE, you have the exact same situation as with truncation: if you have already deleted some files, you must not be able to stop recovery at a point before that.

So I'd say we should update minRecoveryPoint first, then truncate/delete. But we should still keep the XLogFlush() at the end of xact_redo_commit_internal(), for the case where files/directories are created. Patch attached.

- Heikki
*** a/src/backend/access/transam/xact.c
--- b/src/backend/access/transam/xact.c
***************
*** 4617,4639 **** xact_redo_commit_internal(TransactionId xid, XLogRecPtr lsn,
  	}
  
  	/* Make sure files supposed to be dropped are dropped */
! 	for (i = 0; i < nrels; i++)
  	{
! 		SMgrRelation srel = smgropen(xnodes[i], InvalidBackendId);
! 		ForkNumber	fork;
  
! 		for (fork = 0; fork <= MAX_FORKNUM; fork++)
! 			XLogDropRelation(xnodes[i], fork);
! 		smgrdounlink(srel, true);
! 		smgrclose(srel);
  	}
  
  	/*
  	 * We issue an XLogFlush() for the same reason we emit ForceSyncCommit()
! 	 * in normal operation. For example, in DROP DATABASE, we delete all the
! 	 * files belonging to the database, and then commit the transaction. If we
! 	 * crash after all the files have been deleted but before the commit, you
! 	 * have an entry in pg_database without any files. To minimize the window
  	 * for that, we use ForceSyncCommit() to rush the commit record to disk as
  	 * quick as possible. We have the same window during recovery, and forcing
  	 * an XLogFlush() (which updates minRecoveryPoint during recovery) helps
--- 4617,4660 ----
  	}
  
  	/* Make sure files supposed to be dropped are dropped */
! 	if (nrels > 0)
  	{
! 		/*
! 		 * First update minimum recovery point to cover this WAL record. Once
! 		 * a relation is deleted, there's no going back. The buffer manager
! 		 * enforces the WAL-first rule for normal updates to relation files,
! 		 * so that the minimum recovery point is always updated before the
! 		 * corresponding change in the data file is flushed to disk, but we
! 		 * have to do the same here since we're bypassing the buffer manager.
! 		 *
! 		 * Doing this before the deleting the files means that if a deletion
! 		 * fails for some reason, you cannot start up the system even after
! 		 * restart, until you fix the underlying situation so that the
! 		 * deletion will succeed. Alternatively, we could update the minimum
! 		 * recovery point after deletion, but that would leave a small window
! 		 * where the WAL-first rule would be violated.
! 		 */
! 		XLogFlush(lsn);
  
! 		for (i = 0; i < nrels; i++)
! 		{
! 			SMgrRelation srel = smgropen(xnodes[i], InvalidBackendId);
! 			ForkNumber	fork;
! 
! 			for (fork = 0; fork <= MAX_FORKNUM; fork++)
! 				XLogDropRelation(xnodes[i], fork);
! 			smgrdounlink(srel, true);
! 			smgrclose(srel);
! 		}
  	}
  
  	/*
  	 * We issue an XLogFlush() for the same reason we emit ForceSyncCommit()
! 	 * in normal operation. For example, in CREATE DATABASE, we copy all the
! 	 * files from the template database, and then commit the transaction. If we
! 	 * crash after all the files have been copied but before the commit, you
! 	 * have files in the data directory without an entry in pg_database.
! 	 * To minimize the window
  	 * for that, we use ForceSyncCommit() to rush the commit record to disk as
  	 * quick as possible. We have the same window during recovery, and forcing
  	 * an XLogFlush() (which updates minRecoveryPoint during recovery) helps
*** a/src/backend/catalog/storage.c
--- b/src/backend/catalog/storage.c
***************
*** 482,487 **** smgr_redo(XLogRecPtr lsn, XLogRecord *record)
--- 482,504 ----
  		 */
  		smgrcreate(reln, MAIN_FORKNUM, true);
  
+ 		/*
+ 		 * Before we perform the truncation, update minimum recovery point
+ 		 * to cover this WAL record. Once the relation is truncated, there's
+ 		 * no going back. The buffer manager enforces the WAL-first rule
+ 		 * for normal updates to relation files, so that the minimum recovery
+ 		 * point is always updated before the corresponding change in the
+ 		 * data file is flushed to disk.
+ 		 *
+ 		 * Doing this before the truncation means that if the truncation fails
+ 		 * for some reason, you cannot start up the system even after restart,
+ 		 * until you fix the underlying situation so that the truncation will
+ 		 * succeed. Alternatively, we could update the minimum recovery point
+ 		 * after truncation, but that would leave a small window where the
+ 		 * WAL-first rule could be violated.
+ 		 */
+ 		XLogFlush(lsn);
+ 
  		smgrtruncate(reln, MAIN_FORKNUM, xlrec->blkno);
  
  		/* Also tell xlogutils.c about it */
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to