I read through the latest patch,
v31-0001-Support-checksum-enable-disable-in-a-running-clu.patch. Some
comments below:
On 19/01/2021 14:32, Daniel Gustafsson wrote:
+ /*
+ * Hold interrupts for the duration of xlogging to avoid the state of
data
+ * checksums changing during the processing which would later the
premise
+ * for xlogging hint bits.
+ */
Sentence sense does not make.
@@ -904,6 +916,7 @@ static void SetLatestXTime(TimestampTz xtime);
static void SetCurrentChunkStartTime(TimestampTz xtime);
static void CheckRequiredParameterValues(void);
static void XLogReportParameters(void);
+static void XlogChecksums(ChecksumType new_type);
static void checkTimeLineSwitch(XLogRecPtr lsn, TimeLineID newTLI,
TimeLineID
prevTLI);
static void LocalSetXLogInsertAllowed(void);
Spelling: make it "XLogChecksums" for consistency.
/*
* DataChecksumsNeedWrite
* Returns whether data checksums must be written or not
*
* Returns true iff data checksums are enabled or are in the process of being
* enabled. In case data checksums are currently being enabled we must write
* the checksum even though it's not verified during this stage. Interrupts
* need to be held off by the caller to ensure that the returned state is
* valid for the duration of the intended processing.
*/
bool
DataChecksumsNeedWrite(void)
{
Assert(InterruptHoldoffCount > 0);
return (LocalDataChecksumVersion == PG_DATA_CHECKSUM_VERSION ||
LocalDataChecksumVersion ==
PG_DATA_CHECKSUM_INPROGRESS_ON_VERSION ||
LocalDataChecksumVersion ==
PG_DATA_CHECKSUM_INPROGRESS_OFF_VERSION);
}
Can you be more precise on the "duration of the intended processing"? It
means, until you have actually written out the page, or something like
that, right? The similar explanation in DataChecksumsNeedVerify() is
easier to understand. The two functions are similar, it would be good to
phrase the comments similarly, so that you can quickly see the
difference between the two.
/*
* SetDataChecksumsOnInProgress
* Sets the data checksum state to "inprogress-on" to enable
checksums
*
* In order to start the process of enabling data checksums in a running
* cluster the data_checksum_version state must be changed to "inprogress-on".
* This state requires data checksums to be written but not verified. The state
* transition is performed in a critical section in order to provide crash
* safety, and checkpoints are held off. When the emitted procsignalbarrier
* has been absorbed by all backends we know that the cluster has started to
* enable data checksums.
*/
The two "in order" are superfluous, it's more concise to say just "To
start the process ..." (I was made aware of that by Jürgen Purtz's
recent patches that removed a couple of "in order"s from the docs as
unnecessary).
It's a bit confusing to talk about the critical section and
procsignalbarrier here. Does the caller need to wait for the
procsignalbarrier? No, that's just explaining what the function does
internally. Maybe move that explanation inside the function, and say
here something like "This function blocks until all processes have
acknowledged the state change" or something like that.
/*
* SetDataChecksumsOn
* Enables data checksums cluster-wide
*
* Enabling data checksums is performed using two barriers, the first one
* sets the checksums state to "inprogress-on" (which is performed by
* SetDataChecksumsOnInProgress()) and the second one to "on" (performed here).
* During "inprogress-on", checksums are written but not verified. When all
* existing pages are guaranteed to have checksums, and all new pages will be
* initiated with checksums, the state can be changed to "on".
*/
Perhaps the explanation for how these SetDataChecksumsOn() and
SetDataChecksumsOnInProgress() functions work together should be moved
to one place. For example, explain both functions here at
SetDataChecksumsOn(), and just have a "see SetDataChecksumsOn()" in the
other function, or no comment at all if they're kept next to each other.
As it stands, you have to read both comments and piece together the the
big picture in your head. Maybe add a "see datachecksumsworker.c" here,
since there's a longer explanation of the overall mechanism there.
+/*
+ * Disables checksums for the cluster, unless already disabled.
+ *
+ * Has immediate effect - the checksums are set to off right away.
+ */
+Datum
+disable_data_checksums(PG_FUNCTION_ARGS)
+{
+ if (!superuser())
+ ereport(ERROR,
+ (errmsg("must be superuser")));
+
+ StartDatachecksumsWorkerLauncher(false, 0, 0);
+
+ PG_RETURN_BOOL(true);
+}
The documentation says "Returns <literal>false</literal> in case data
checksums are disabled already", but the function always returns true.
The enable_data_checksum() function also returns a constant 'true'; why
not make it void?
The "has immediate effect" comment seems wrong, given that it actually
launches a worker process.
+ /*
+ * If the launcher is already started, the only operation we can perform
+ * is to cancel it iff the user requested for checksums to be disabled.
+ * That doesn't however mean that all other cases yield an error, as
some
+ * might be perfectly benevolent.
+ */
This comment is a bit hard to understand. Maybe something like
"If the launcher is already started, we cannot launch a new one. But if
the user requested for checksums to be disabled, we can cancel it."
+ if (DatachecksumsWorkerShmem->launcher_started)
+ {
+ if (DatachecksumsWorkerShmem->abort)
+ {
+ ereport(NOTICE,
+ (errmsg("data checksum processing is
concurrently being aborted, please retry")));
+
+ LWLockRelease(DatachecksumsWorkerLock);
+ return;
+ }
If it's being aborted, and the user requested to disable checksum, can
we leave out the NOTICE? I guess not, because the worker process won't
check the 'abort' flag, if it's already finished processing all data pages.
+ /*
+ * If the launcher is started data checksums cannot be on or
off, but
+ * it may be in an inprogress state. Since the state transition
may
+ * not have happened yet (in case of rapidly initiated checksum
enable
+ * calls for example) we inspect the target state of the
currently
+ * running launcher.
+ */
This comment contradicts itself. If the state transition has not
happened yet, then contrary to the first sentence, data checksums *are*
currently on or off.
+ if (enable_checksums)
+ {
+ /*
+ * If we are asked to enable checksums when they are
already being
+ * enabled, there is nothing to do so exit.
+ */
+ if (DatachecksumsWorkerShmem->enable_checksums)
+ {
+ LWLockRelease(DatachecksumsWorkerLock);
+ return;
+ }
+
+ /*
+ * Disabling checksums is likely to be a very quick
operation in
+ * many cases so trying to abort it to save the
checksums would
+ * run the risk of race conditions.
+ */
+ else
+ {
+ ereport(NOTICE,
+ (errmsg("data checksums are
concurrently being disabled, please retry")));
+
+ LWLockRelease(DatachecksumsWorkerLock);
+ return;
+ }
+
+ /* This should be unreachable */
+ Assert(false);
+ }
+ else if (!enable_checksums)
+ {
+ /*
+ * Data checksums are already being disabled, exit
silently.
+ */
+ if (DataChecksumsOffInProgress())
+ {
+ LWLockRelease(DatachecksumsWorkerLock);
+ return;
+ }
+
+ DatachecksumsWorkerShmem->abort = true;
+ LWLockRelease(DatachecksumsWorkerLock);
+ return;
+ }
The Assert seems unnecessary. The "if (!enable_checkums)" also seems
unnecessary, could be just "else".
+ /*
+ * The launcher is currently not running, so we need to query the system
+ * data checksum state to determine how to proceed based on the
requested
+ * target state.
+ */
"query the system" makes me think "checking the catalogs" or similar,
but this just looks at a few variables in shared memory.
+/*
+ * ShutdownDatachecksumsWorkerIfRunning
+ * Request shutdown of the datachecksumsworker
+ *
+ * This does not turn off processing immediately, it signals the checksum
+ * process to end when done with the current block.
+ */
+void
+ShutdownDatachecksumsWorkerIfRunning(void)
+{
+ LWLockAcquire(DatachecksumsWorkerLock, LW_EXCLUSIVE);
+
+ /* If the launcher isn't started, there is nothing to shut down */
+ if (DatachecksumsWorkerShmem->launcher_started)
+ DatachecksumsWorkerShmem->abort = true;
+
+ LWLockRelease(DatachecksumsWorkerLock);
+}
This function is unused.
+/*
+ * launcher_cancel_handler
+ *
+ * Internal routine for reacting to SIGINT and flagging the worker to abort.
+ * The worker won't be interrupted immediately but will check for abort flag
+ * between each block in a relation.
+ */
+static void
+launcher_cancel_handler(SIGNAL_ARGS)
+{
+ LWLockAcquire(DatachecksumsWorkerLock, LW_EXCLUSIVE);
+ DatachecksumsWorkerShmem->abort = true;
+ LWLockRelease(DatachecksumsWorkerLock);
+}
Acquiring an lwlock in signal handler is not safe.
Since this is a signal handler, it might still get called after the
process has finished processing, and has already set
DatachecksumsWorkerShmem->launcher_started = false. That seems like an
unexpected state.
+/*
+ * WaitForAllTransactionsToFinish
+ * Blocks awaiting all current transactions to finish
+ *
+ * Returns when all transactions which are active at the call of the function
+ * have ended, or if the postmaster dies while waiting. If the postmaster dies
+ * the abort flag will be set to indicate that the caller of this shouldn't
+ * proceed.
+ */
The caller of this function doesn't check the 'abort' flag AFAICS. I
think you could just use use WL_EXIT_ON_PM_DEATH to error out on
postmaster death.
+ while (true)
+ {
+ int processed_databases = 0;
+
+ /*
+ * Get a list of all databases to process. This may include
databases
+ * that were created during our runtime.
+ *
+ * Since a database can be created as a copy of any other
database
+ * (which may not have existed in our last run), we have to
repeat
+ * this loop until no new databases show up in the list. Since
we wait
+ * for all pre-existing transactions finish, this way we can be
+ * certain that there are no databases left without checksums.
+ */
+ DatabaseList = BuildDatabaseList();
I think it's unnecessary to wait out the transactions on the first
iteration of this loop. There must be at least one database, so there
will always be at least two iterations.
I think it would be more clear to move the
WaitForAllTransactionsToFinish() from BuildDatabaseList() to the callers.
@@ -3567,6 +3571,27 @@ RelationBuildLocalRelation(const char *relname,
relkind == RELKIND_MATVIEW)
RelationInitTableAccessMethod(rel);
+ /*
+ * Set the data checksum state. Since the data checksum state can
change at
+ * any time, the fetched value might be out of date by the time the
+ * relation is built. DataChecksumsNeedWrite returns true when data
+ * checksums are: enabled; are in the process of being enabled (state:
+ * "inprogress-on"); are in the process of being disabled (state:
+ * "inprogress-off"). Since relhaschecksums is only used to track
progress
+ * when data checksums are being enabled, and going from disabled to
+ * enabled will clear relhaschecksums before starting, it is safe to use
+ * this value for a concurrent state transition to off.
+ *
+ * If DataChecksumsNeedWrite returns false, and is concurrently changed
to
+ * true then that implies that checksums are being enabled. Worst case,
+ * this will lead to the relation being processed for checksums even
though
+ * each page written will have them already. Performing this last
shortens
+ * the window, but doesn't avoid it.
+ */
+ HOLD_INTERRUPTS();
+ rel->rd_rel->relhaschecksums = DataChecksumsNeedWrite();
+ RESUME_INTERRUPTS();
+
/*
* Okay to insert into the relcache hash table.
*
I grepped for relhashcheckums, and concluded that the value in the
relcache isn't actually used for anything. Not so! In
heap_create_with_catalog(), the actual pg_class row is constructed from
the relcache entry, so the value set in RelationBuildLocalRelation()
finds its way to pg_class. Perhaps it would be more clear to pass
relhachecksums directly as an argument to AddNewRelationTuple(). That
way, the value in the relcache would be truly never used.
- Heikki