On Wed, Dec 02, 2020 at 01:50:25PM +0530, Amit Kapila wrote: > On Wed, Dec 2, 2020 at 1:20 PM Dilip Kumar <dilipbal...@gmail.com> wrote: > > > On Mon, Nov 30, 2020 at 6:49 PM Amit Kapila <amit.kapil...@gmail.com> > > > wrote: > > > > On Mon, Nov 30, 2020 at 3:14 AM Noah Misch <n...@leadboat.com> wrote: > > > > > On Mon, Sep 07, 2020 at 12:00:41PM +0530, Amit Kapila wrote: > > > > > > Thanks, I have pushed the last patch. Let's wait for a day or so to > > > > > > see the buildfarm reports > > > > > > > > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2020-09-08%2006%3A24%3A14 > > > > > failed the new 015_stream.pl test with the subscriber looping like > > > > > this: > > > > > > > > > > 2020-09-08 11:22:49.848 UTC [13959252:1] LOG: logical replication > > > > > apply worker for subscription "tap_sub" has started > > > > > 2020-09-08 11:22:54.045 UTC [13959252:2] ERROR: could not open > > > > > temporary file "16393-510.changes.0" from BufFile > > > > > "16393-510.changes": No such file or directory > > > > > 2020-09-08 11:22:54.055 UTC [7602182:1] LOG: logical replication > > > > > apply worker for subscription "tap_sub" has started > > > > > 2020-09-08 11:22:54.101 UTC [31785284:4] LOG: background worker > > > > > "logical replication worker" (PID 13959252) exited with exit code 1 > > > > > 2020-09-08 11:23:01.142 UTC [7602182:2] ERROR: could not open > > > > > temporary file "16393-510.changes.0" from BufFile > > > > > "16393-510.changes": No such file or directory > > > > > ...
> > > > The above kind of error can happen due to the following reasons: (a) > > > > the first time we sent the stream and created the file and that got > > > > removed before the second stream reached the subscriber. (b) from the > > > > publisher-side, we never sent the indication that it is the first > > > > stream and the subscriber directly tries to open the file thinking it > > > > is already there. Further testing showed it was a file location problem, not a deletion problem. The worker tried to open base/pgsql_tmp/pgsql_tmp9896408.1.sharedfileset/16393-510.changes.0, but these were the files actually existing: [nm@power-aix 0:2 2020-12-08T13:56:35 64gcc 0]$ ls -la $(find src/test/subscription/tmp_check -name '*sharedfileset*') src/test/subscription/tmp_check/t_015_stream_subscriber_data/pgdata/base/pgsql_tmp/pgsql_tmp9896408.0.sharedfileset: total 408 drwx------ 2 nm usr 256 Dec 08 03:20 . drwx------ 4 nm usr 256 Dec 08 03:20 .. -rw------- 1 nm usr 207806 Dec 08 03:20 16393-510.changes.0 src/test/subscription/tmp_check/t_015_stream_subscriber_data/pgdata/base/pgsql_tmp/pgsql_tmp9896408.1.sharedfileset: total 0 drwx------ 2 nm usr 256 Dec 08 03:20 . drwx------ 4 nm usr 256 Dec 08 03:20 .. -rw------- 1 nm usr 0 Dec 08 03:20 16393-511.changes.0 > > I have executed "make check" in the loop with only this file. I have > > repeated it 5000 times but no failure, I am wondering shall we try to > > execute in the same machine in a loop where it failed once? > > Yes, that might help. Noah, would it be possible for you to try that The problem is xidhash using strcmp() to compare keys; it needs memcmp(). For this to matter, xidhash must contain more than one element. Existing tests rarely exercise the multi-element scenario. Under heavy load, on this system, the test publisher can have two active transactions at once, in which case it does exercise multi-element xidhash. (The publisher is sensitive to timing, but the subscriber is not; once WAL contains interleaved records of two XIDs, the subscriber fails every time.) This would be much harder to reproduce on a little-endian system, where strcmp(&xid, &xid_plus_one)!=0. On big-endian, every small XID has zero in the first octet; they all look like empty strings. The attached patch has the one-line fix and some test suite changes that make this reproduce frequently on any big-endian system. I'm currently planning to drop the test suite changes from the commit, but I could keep them if folks like them. (They'd need more comments and timeout handling.)
Author: Noah Misch <n...@leadboat.com> Commit: Noah Misch <n...@leadboat.com> Use HASH_BLOBS for xidhash. This caused BufFile errors on buildfarm member sungazer, and SIGSEGV was possible. Conditions for reaching those symptoms were more frequent on big-endian systems. Reviewed by FIXME. Discussion: https://postgr.es/m/20201129214441.ga691...@rfd.leadboat.com diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c index c37aafe..fce1dee 100644 --- a/src/backend/replication/logical/worker.c +++ b/src/backend/replication/logical/worker.c @@ -804,7 +804,7 @@ apply_handle_stream_start(StringInfo s) hash_ctl.entrysize = sizeof(StreamXidHash); hash_ctl.hcxt = ApplyContext; xidhash = hash_create("StreamXidHash", 1024, &hash_ctl, - HASH_ELEM | HASH_CONTEXT); + HASH_ELEM | HASH_BLOBS | HASH_CONTEXT); } /* open the spool file for this transaction */ diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm index 1488bff..40610f1 100644 --- a/src/test/perl/PostgresNode.pm +++ b/src/test/perl/PostgresNode.pm @@ -1626,6 +1626,42 @@ sub interactive_psql return $harness; } +# return IPC::Run harness object for non-interactive psql +# FIXME pick a better name, and add POD docs +sub psql_printable +{ + my ($self, $dbname, $stdin, $stdout, $timer, %params) = @_; + + my $replication = $params{replication}; + + my @psql_params = ( + 'psql', + '-XAtq', + '-d', + $self->connstr($dbname) + . (defined $replication ? " replication=$replication" : ""), + '-f', + '-'); + + $params{on_error_stop} = 1 unless defined $params{on_error_stop}; + + push @psql_params, '-v', 'ON_ERROR_STOP=1' if $params{on_error_stop}; + push @psql_params, @{ $params{extra_params} } + if defined $params{extra_params}; + + # Ensure there is no data waiting to be sent: + $$stdin = "" if ref($stdin); + # IPC::Run would otherwise append to existing contents: + $$stdout = "" if ref($stdout); + + my $harness = IPC::Run::start \@psql_params, + '<', $stdin, '>', $stdout, $timer; + + die "psql startup timed out" if $timer->is_expired; + + return $harness; +} + =pod =item $node->poll_query_until($dbname, $query [, $expected ]) diff --git a/src/test/subscription/t/015_stream.pl b/src/test/subscription/t/015_stream.pl index fffe001..9ebe166 100644 --- a/src/test/subscription/t/015_stream.pl +++ b/src/test/subscription/t/015_stream.pl @@ -47,14 +47,35 @@ my $result = is($result, qq(2|2|2), 'check initial data was copied to subscriber'); # Insert, update and delete enough rows to exceed the 64kB limit. -$node_publisher->safe_psql('postgres', q{ +my $in = ''; +my $out = ''; + +my $timer = IPC::Run::timer(180); + +my $h = $node_publisher->psql_printable('postgres', \$in, \$out, $timer); + +$in .= q{ BEGIN; INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(3, 5000) s(i); UPDATE test_tab SET b = md5(b) WHERE mod(a,2) = 0; DELETE FROM test_tab WHERE mod(a,3) = 0; +}; +$h->pump; + +$node_publisher->safe_psql('postgres', q{ +BEGIN; +INSERT INTO test_tab SELECT i, md5(i::text) FROM generate_series(5001, 9999) s(i); +DELETE FROM test_tab WHERE a > 5000; COMMIT; }); +$in .= q{ +COMMIT; +\q +}; +$h->pump; +$h->finish; + $node_publisher->wait_for_catchup($appname); $result =