On Sun, Jan 23, 2022 at 05:40:54PM -0800, Andres Freund wrote:
> On 2022-01-23 17:17:59 -0800, Noah Misch wrote:
> > On Sun, Jan 23, 2022 at 05:03:04PM -0800, Andres Freund wrote:
> > > On January 23, 2022 3:29:27 PM PST
> > > >(a) Modify the tests so the affected animals can skip affected tests by
> > > >setting an environment variable, named PG_TEST_HAS_WAL_READ_BUG or 
> > > >similar.
> > > 
> > > Why not just detect the problem in the tap test and skip, rather than 
> > > requiring multiple buildfarm configs to be changed as well as the test 
> > > itself? 
> > 
> > End users running PostgreSQL test suites to acceptance-test their stack 
> > should
> > consider the affected stack unusable for PostgreSQL.
> 
> I'd bet that that's zero users ;)

Wouldn't surprise me.  I'm attaching what I had written and discarded.  If
nobody else hates it, I can live with it.

> > Hence, I ruled out that
> > approach, despite having implemented it at one point.  Under some plausible
> > set of goals, it is optimal.
> 
> It's not perfect due to the way we run our tests (seeing output is hard, it's
> not aggregated), but marking the test as todo rather than SKIP seems like the
> most appropriate test status. It's known to be a problem, we've not fixed it,
> but we want to be able to run the tests.
> 
> Test::more's description: "If it's something the programmer hasn't done yet,
> use TODO. This is for any code you haven't written yet, or bugs you have yet
> to fix, but want to put tests in your testing script (always a good idea)."

Could do that.  Every run that doesn't get the flaky failure will print a
message like "TODO passed:  3-5", though the test file could mitigate that by
declaring the TODO only on configurations where we expect a failure.  The
027_stream_regress.pl trouble involves reaching a die(), not failing a test in
this sense, so that one would take more work.
Author:     Noah Misch <n...@leadboat.com>
Commit:     Noah Misch <n...@leadboat.com>

    On sparc64+ext4, skip tests that expect flakes from WAL read failure.
    
    Buildfarm member kittiwake began to fail frequently when commits
    3cd9c3b921977272e6650a5efbeade4203c4bca2 and
    f47ed79cc8a0cfa154dc7f01faaf59822552363f added tests of concurrency, but
    the problem was reachable before those commits.  Back-patch to v10 (all
    supported versions).
    
    Reviewed by FIXME.
    
    Discussion: https://postgr.es/m/20220116210241.gc756...@rfd.leadboat.com

diff --git a/contrib/amcheck/t/003_cic_2pc.pl b/contrib/amcheck/t/003_cic_2pc.pl
index dfe7f0f..9ef3d49 100644
--- a/contrib/amcheck/t/003_cic_2pc.pl
+++ b/contrib/amcheck/t/003_cic_2pc.pl
@@ -9,7 +9,16 @@ use Config;
 use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 
-use Test::More tests => 5;
+use Test::More;
+
+if (PostgreSQL::Test::Utils::has_wal_read_bug)
+{
+       plan skip_all => 'filesystem bug';
+}
+else
+{
+       plan tests => 5;
+}
 
 my ($node, $result);
 
diff --git a/src/test/perl/PostgreSQL/Test/Utils.pm 
b/src/test/perl/PostgreSQL/Test/Utils.pm
index 50be10f..cbd2cb3 100644
--- a/src/test/perl/PostgreSQL/Test/Utils.pm
+++ b/src/test/perl/PostgreSQL/Test/Utils.pm
@@ -351,6 +351,29 @@ sub perl2host
 
 =pod
 
+=item has_wal_read_bug()
+
+Returns true if $tmp_check is subject to a sparc64+ext4 bug that causes WAL
+readers to see zeros if another process simultaneously wrote the same offsets.
+Use this to skip tests that fail frequently on affected configurations.  The
+bug has made streaming standbys fail to advance, reporting corrupt WAL.  It
+has made COMMIT PREPARED fail with "could not read two-phase state from WAL".
+Non-WAL PostgreSQL reads haven't been affected, likely because those readers
+and writers have buffering systems in common.  See
+https://postgr.es/m/20220116210241.gc756...@rfd.leadboat.com for details.
+
+=cut
+
+sub has_wal_read_bug
+{
+       return
+            $Config{osname} eq 'linux'
+         && $Config{archname} =~ /^sparc/
+         && !run_log([ qw(df -x ext4), $tmp_check ], '>', '/dev/null', '2>&1');
+}
+
+=pod
+
 =item system_log(@cmd)
 
 Run (via C<system()>) the command passed as argument; the return
diff --git a/src/test/recovery/t/027_stream_regress.pl 
b/src/test/recovery/t/027_stream_regress.pl
index 8c0a8b6..2bb83d3 100644
--- a/src/test/recovery/t/027_stream_regress.pl
+++ b/src/test/recovery/t/027_stream_regress.pl
@@ -3,9 +3,18 @@ use strict;
 use warnings;
 use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
-use Test::More tests => 4;
+use Test::More;
 use File::Basename;
 
+if (PostgreSQL::Test::Utils::has_wal_read_bug)
+{
+       plan skip_all => 'filesystem bug';
+}
+else
+{
+       plan tests => 4;
+}
+
 # Initialize primary node
 my $node_primary = PostgreSQL::Test::Cluster->new('primary');
 $node_primary->init(allows_streaming => 1);

Reply via email to