Hi,

Am Mittwoch, den 06.02.2019, 11:39 -0500 schrieb Stephen Frost:
> * Michael Banck (michael.ba...@credativ.de) wrote:
> > Am Dienstag, den 05.02.2019, 11:30 +0100 schrieb Tomas Vondra:
> > > On 2/5/19 8:01 AM, Andres Freund wrote:
> > > > On 2019-02-05 06:57:06 +0100, Fabien COELHO wrote:
> > > > > > > > I'm wondering (possibly again) about the existing early exit if 
> > > > > > > > one block
> > > > > > > > cannot be read on retry: the command should count this as a 
> > > > > > > > kind of bad
> > > > > > > > block, proceed on checking other files, and obviously fail in 
> > > > > > > > the end, but
> > > > > > > > having checked everything else and generated a report. I do not 
> > > > > > > > think that
> > > > > > > > this condition warrants a full stop. ISTM that under rare race 
> > > > > > > > conditions
> > > > > > > > (eg, an unlucky concurrent "drop database" or "drop table") 
> > > > > > > > this could
> > > > > > > > happen when online, although I could not trigger one despite 
> > > > > > > > heavy testing,
> > > > > > > > so I'm possibly mistaken.
> > > > > > > 
> > > > > > > This seems like a defensible judgement call either way.
> > > > > > 
> > > > > > Right now we have a few tests that explicitly check that
> > > > > > pg_verify_checksums fail on broken data ("foo" in the file).  Those
> > > > > > would then just get skipped AFAICT, which I think is the worse 
> > > > > > behaviour
> > > > > > , but if everybody thinks that should be the way to go, we can
> > > > > > drop/adjust those tests and make pg_verify_checksums skip them.
> > > > > > 
> > > > > > Thoughts?
> > > > > 
> > > > > My point is that it should fail as it does, only not immediately 
> > > > > (early
> > > > > exit), but after having checked everything else. This mean avoiding 
> > > > > calling
> > > > > "exit(1)" here and there (lseek, fopen...), but taking note that 
> > > > > something
> > > > > bad happened, and call exit only in the end.
> > > > 
> > > > I can see both as being valuable (one gives you a more complete picture,
> > > > the other a quicker answer in scripts). For me that's the point where
> > > > it's the prerogative of the author to make that choice.
> 
> ... unless people here object or prefer other options, and then it's up
> to discussion and hopefully some consensus comes out of it.
> 
> Also, I have to say that I really don't think the 'quicker answer'
> argument holds any weight, making me question if that's a valid
> use-case.  If there *isn't* an issue, which we would likely all agree is
> the case the vast majority of the time that this is going to be run,
> then it's going to take quite a while and anyone calling it should
> expect and be prepared for that.  In the extremely rare cases, what does
> exiting early actually do for us?
> 
> > Personally, I would prefer to keep it as simple as possible for now and
> > get this patch committed; in my opinion the behaviour is already like
> > this (early exit on corrupt files) so I don't think the online
> > verification patch should change this.
> 
> I'm also in the camp of "would rather it not exit immediately, so the
> extent of the issue is clear".
> 
> > If we see complaints about this, then I'd be happy to change it
> > afterwards.
> 
> I really don't think this is something we should change later on in a
> future release..  If the consensus is that there's really two different
> but valid use-cases then we should make it configurable, but I'm not
> convinced there is.

OK, fair enough.

> > > Why not make this configurable, using a command-line option?
> > 
> > I like this even less - this tool is about verifying checksums, so
> > adding options on what to do when it encounters broken pages looks out-
> > of-scope to me.  Unless we want to say it should generally abort on the
> > first issue (i.e. on wrong checksums as well).
> 
> I definitely disagree that it's somehow 'out of scope' for this tool to
> skip broken pages, when we can tell that they're broken.  

I didn't mean that it's out-of-scope for pg_verify_checksums, I meant it
is out-of-scope for this patch, which adds online checking.

> There is a question here about how to handle a short read since that
> can happen under normal conditions if we're unlucky.  The same is also
> true for files disappearing entirely.
> 
> So, let's talk/think through a few cases:
> 
> A file with just 'foo\n' in it- could that be a page starting with
> an LSN around 666F6F0A that we somehow only read the first few bytes of?
> If not, why not?  I could possibly see an argument that we expect to
> always get at least 512 bytes in a read, or 4K, but it seems like we
> could possibly run into edge cases on odd filesystems or such.  In the
> end, I'm leaning towards categorizing different things, well,
> differently- a short read would be reported as a NOTICE or equivilant,
> perhaps, meaning that the test case needs to do something more than just
> have a file with 'foo' in it, but that is likely a good things anyway-
> the test cases would be better if they were closer to real world.  Other
> read failures would be reported in a more serious category assuming they
> are "this really shouldn't happen" cases.  A file disappearing isn't a
> "can't happen" case, and might be reported at the same 'NOTICE' level
> (or maybe with a 'verbose' ption).

In the context of this patch, we should also discern whether a
particular case is merely a notice (or warning) on an offline cluster, I
guess you think it should be?

So I've changed it such that a short read emits a "warning" message,
increments a new skippedfiles (as it is not just a skipped block)
variable and reports its number at the end - should it then exit with >
0 even if there were no wrong checksums?

> A file that's 8k in size and has a checksum but it's not right seems
> pretty clear to me.  Might as well include a count of pages which have a
> valid checksum, I would think, though perhaps only in a 'verbose' mode
> would that get reported.

What's the use for that? It already reports the number of scanned blocks
at the end, so that number is pretty easy to figure out from it and the
number of bad checksums. 
 
> A completely zero'd page could also be reported at a NOTICE level or
> with a count, or perhaps only with verbose.

It is counted as a skipped block right now (well, every block that
qualifes for PageIsNew() is), but skipped blocks are not mentioned right
now. I guess the rationale is that it might lead to excessive screen
output (but then, verbose originally logged /every/ block), but you'd
have to check with the original authors.

So I have now changed behaviour so that short writes count as skipped
files and pg_verify_checksums no longer bails out on them. When this
occors a warning is written to stderr and their overall count is also
reported at the end. However, unless there are other blocks with bad
checksums, the exit status is kept at zero.

New patch attached.


Michael

-- 
Michael Banck
Projektleiter / Senior Berater
Tel.: +49 2166 9901-171
Fax:  +49 2166 9901-100
Email: michael.ba...@credativ.de

credativ GmbH, HRB Mönchengladbach 12080
USt-ID-Nummer: DE204566209
Trompeterallee 108, 41189 Mönchengladbach
Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer

Unser Umgang mit personenbezogenen Daten unterliegt
folgenden Bestimmungen: https://www.credativ.de/datenschutz
diff --git a/doc/src/sgml/ref/pg_verify_checksums.sgml b/doc/src/sgml/ref/pg_verify_checksums.sgml
index 905b8f1222..4ad6edcde6 100644
--- a/doc/src/sgml/ref/pg_verify_checksums.sgml
+++ b/doc/src/sgml/ref/pg_verify_checksums.sgml
@@ -37,9 +37,8 @@ PostgreSQL documentation
   <title>Description</title>
   <para>
    <command>pg_verify_checksums</command> verifies data checksums in a
-   <productname>PostgreSQL</productname> cluster.  The server must be shut
-   down cleanly before running <application>pg_verify_checksums</application>.
-   The exit status is zero if there are no checksum errors, otherwise nonzero.
+   <productname>PostgreSQL</productname> cluster.  The exit status is zero if
+   there are no checksum errors, otherwise nonzero.  
   </para>
  </refsect1>
 
diff --git a/src/bin/pg_verify_checksums/pg_verify_checksums.c b/src/bin/pg_verify_checksums/pg_verify_checksums.c
index 511262ab5f..801cf6750a 100644
--- a/src/bin/pg_verify_checksums/pg_verify_checksums.c
+++ b/src/bin/pg_verify_checksums/pg_verify_checksums.c
@@ -1,7 +1,7 @@
 /*
  * pg_verify_checksums
  *
- * Verifies page level checksums in an offline cluster
+ * Verifies page level checksums in a cluster
  *
  *	Copyright (c) 2010-2019, PostgreSQL Global Development Group
  *
@@ -24,12 +24,16 @@
 
 
 static int64 files = 0;
+static int64 skippedfiles = 0;
 static int64 blocks = 0;
 static int64 badblocks = 0;
+static int64 skippedblocks = 0;
 static ControlFileData *ControlFile;
+static XLogRecPtr checkpointLSN;
 
 static char *only_relfilenode = NULL;
 static bool verbose = false;
+static bool online = false;
 
 static const char *progname;
 
@@ -86,10 +90,17 @@ scan_file(const char *fn, BlockNumber segmentno)
 	PageHeader	header = (PageHeader) buf.data;
 	int			f;
 	BlockNumber blockno;
+	bool		block_retry = false;
 
 	f = open(fn, O_RDONLY | PG_BINARY, 0);
 	if (f < 0)
 	{
+		if (online && errno == ENOENT)
+		{
+			/* File was removed in the meantime */
+			return;
+		}
+
 		fprintf(stderr, _("%s: could not open file \"%s\": %s\n"),
 				progname, fn, strerror(errno));
 		exit(1);
@@ -104,26 +115,107 @@ scan_file(const char *fn, BlockNumber segmentno)
 
 		if (r == 0)
 			break;
+		if (r < 0)
+		{
+			fprintf(stderr, _("%s: could not read block %u in file \"%s\": %s\n"),
+					progname, blockno, fn, strerror(errno));
+			return;
+		}
 		if (r != BLCKSZ)
 		{
-			fprintf(stderr, _("%s: could not read block %u in file \"%s\": read %d of %d\n"),
-					progname, blockno, fn, r, BLCKSZ);
-			exit(1);
+			if (block_retry)
+			{
+				/* We already tried once to reread the block, skip it */
+				skippedfiles++;
+				fprintf(stderr, _("%s: warning: could not read block %u in file \"%s\": read %d of %d\n"),
+						progname, blockno, fn, r, BLCKSZ);
+				continue;
+			}
+
+			/*
+			 * Retry the block. It's possible that we read the block while it
+			 * was extended or shrinked, so it it ends up looking torn to us.
+			 */
+
+			/*
+			 * Seek back by the amount of bytes we read to the beginning of
+			 * the failed block.
+			 */
+			if (lseek(f, -r, SEEK_CUR) == -1)
+			{
+				fprintf(stderr, _("%s: could not lseek in file \"%s\": %m\n"),
+						progname, fn);
+				exit(1);
+			}
+
+			/* Set flag so we know a retry was attempted */
+			block_retry = true;
+
+			/* Reset loop to validate the block again */
+			blockno--;
+
+			continue;
 		}
-		blocks++;
 
 		/* New pages have no checksum yet */
 		if (PageIsNew(header))
+		{
+			skippedblocks++;
 			continue;
+		}
+
+		blocks++;
 
 		csum = pg_checksum_page(buf.data, blockno + segmentno * RELSEG_SIZE);
 		if (csum != header->pd_checksum)
 		{
+			/*
+			 * Retry the block on the first failure.  If the verification is
+			 * done while the instance is online, it is possible that we read
+			 * the first 4K page of the block just before postgres updated the
+			 * entire block so it ends up looking torn to us.  We only need to
+			 * retry once because the LSN should be updated to something we can
+			 * ignore on the next pass.  If the error happens again then it is
+			 * a true validation failure.
+			 */
+			if (!block_retry)
+			{
+				/* Seek to the beginning of the failed block */
+				if (lseek(f, -BLCKSZ, SEEK_CUR) == -1)
+				{
+					fprintf(stderr, _("%s: could not lseek in file \"%s\": %m\n"),
+							progname, fn);
+					exit(1);
+				}
+
+				/* Set flag so we know a retry was attempted */
+				block_retry = true;
+
+				/* Reset loop to validate the block again */
+				blockno--;
+
+				continue;
+			}
+
+			/*
+			 * The checksum verification failed on retry as well.  Check if the
+			 * page has been modified since the checkpoint and skip it in this
+			 * case.
+			 */
+			if (PageGetLSN(buf.data) > checkpointLSN)
+			{
+				block_retry = false;
+				blocks--;
+				skippedblocks++;
+				continue;
+			}
+
 			if (ControlFile->data_checksum_version == PG_DATA_CHECKSUM_VERSION)
 				fprintf(stderr, _("%s: checksum verification failed in file \"%s\", block %u: calculated checksum %X but block contains %X\n"),
 						progname, fn, blockno, csum, header->pd_checksum);
 			badblocks++;
 		}
+		block_retry = false;
 	}
 
 	if (verbose)
@@ -172,6 +264,12 @@ scan_directory(const char *basedir, const char *subdir)
 		snprintf(fn, sizeof(fn), "%s/%s", path, de->d_name);
 		if (lstat(fn, &st) < 0)
 		{
+			if (online && errno == ENOENT)
+			{
+				/* File was removed in the meantime */
+				continue;
+			}
+
 			fprintf(stderr, _("%s: could not stat file \"%s\": %s\n"),
 					progname, fn, strerror(errno));
 			exit(1);
@@ -308,7 +406,7 @@ main(int argc, char *argv[])
 		exit(1);
 	}
 
-	/* Check if cluster is running */
+	/* Check if checksums are enabled */
 	ControlFile = get_controlfile(DataDir, progname, &crc_ok);
 	if (!crc_ok)
 	{
@@ -316,12 +414,10 @@ main(int argc, char *argv[])
 		exit(1);
 	}
 
+	/* Check if cluster is running */
 	if (ControlFile->state != DB_SHUTDOWNED &&
 		ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY)
-	{
-		fprintf(stderr, _("%s: cluster must be shut down to verify checksums\n"), progname);
-		exit(1);
-	}
+		online = true;
 
 	if (ControlFile->data_checksum_version == 0)
 	{
@@ -329,6 +425,9 @@ main(int argc, char *argv[])
 		exit(1);
 	}
 
+	/* Get checkpoint LSN */
+	checkpointLSN = ControlFile->checkPoint;
+
 	/* Scan all files */
 	scan_directory(DataDir, "global");
 	scan_directory(DataDir, "base");
@@ -337,7 +436,11 @@ main(int argc, char *argv[])
 	printf(_("Checksum scan completed\n"));
 	printf(_("Data checksum version: %d\n"), ControlFile->data_checksum_version);
 	printf(_("Files scanned:  %s\n"), psprintf(INT64_FORMAT, files));
+	if (skippedfiles > 0)
+		printf(_("Files skipped: %s\n"), psprintf(INT64_FORMAT, skippedfiles));
 	printf(_("Blocks scanned: %s\n"), psprintf(INT64_FORMAT, blocks));
+	if (skippedblocks > 0)
+		printf(_("Blocks skipped: %s\n"), psprintf(INT64_FORMAT, skippedblocks));
 	printf(_("Bad checksums:  %s\n"), psprintf(INT64_FORMAT, badblocks));
 
 	if (badblocks > 0)
diff --git a/src/bin/pg_verify_checksums/t/002_actions.pl b/src/bin/pg_verify_checksums/t/002_actions.pl
index 5250b5a728..8c1581037e 100644
--- a/src/bin/pg_verify_checksums/t/002_actions.pl
+++ b/src/bin/pg_verify_checksums/t/002_actions.pl
@@ -5,7 +5,7 @@ use strict;
 use warnings;
 use PostgresNode;
 use TestLib;
-use Test::More tests => 45;
+use Test::More tests => 69;
 
 
 # Utility routine to create and check a table with corrupted checksums
@@ -45,7 +45,7 @@ sub check_relation_corruption
 	# Time to create some corruption
 	open my $file, '+<', "$pgdata/$file_corrupted";
 	seek($file, $pageheader_size, 0);
-	syswrite($file, '\0\0\0\0\0\0\0\0\0');
+	syswrite($file, "\0\0\0\0\0\0\0\0\0");
 	close $file;
 
 	# Checksum checks on single relfilenode fail
@@ -104,10 +104,10 @@ append_to_file "$pgdata/global/pgsql_tmp/1.1", "foo";
 command_ok(['pg_verify_checksums',  '-D', $pgdata],
 		   "succeeds with offline cluster");
 
-# Checks cannot happen with an online cluster
+# Checksums pass on an online cluster
 $node->start;
-command_fails(['pg_verify_checksums',  '-D', $pgdata],
-			  "fails with online cluster");
+command_ok(['pg_verify_checksums',  '-D', $pgdata],
+		   "succeeds with online cluster");
 
 # Check corruption of table on default tablespace.
 check_relation_corruption($node, 'corrupt1', 'pg_default');
@@ -134,10 +134,10 @@ sub fail_corrupt
 	append_to_file $file_name, "foo";
 
 	$node->command_checks_all([ 'pg_verify_checksums', '-D', $pgdata],
-						  1,
-						  [qr/^$/],
+						  0,
+						  [qr/Files skipped:.*1/],
 						  [qr/could not read block 0 in file.*$file\":/],
-						  "fails for corrupted data in $file");
+						  "skips corrupted data in $file");
 
 	# Remove file to prevent future lookup errors on conflicts.
 	unlink $file_name;
@@ -158,3 +158,17 @@ fail_corrupt($node, "99990_vm");
 fail_corrupt($node, "99990_init.123");
 fail_corrupt($node, "99990_fsm.123");
 fail_corrupt($node, "99990_vm.123");
+
+# Start node again to test online verification correctly finds failures
+$node->start;
+fail_corrupt($node, "99990");
+fail_corrupt($node, "99990.123");
+fail_corrupt($node, "99990_fsm");
+fail_corrupt($node, "99990_init");
+fail_corrupt($node, "99990_vm");
+fail_corrupt($node, "99990_init.123");
+fail_corrupt($node, "99990_fsm.123");
+fail_corrupt($node, "99990_vm.123");
+
+# Stop node again at the end of tests
+$node->stop;

Reply via email to