> On 31 July 2017 at 07:49, Thomas Munro <thomas.mu...@enterprisedb.com> wrote: >> On Sun, Jul 30, 2017 at 7:07 AM, Dmitry Dolgov <9erthali...@gmail.com> wrote: >> >> I looked through the code of `synchronous-replay-v1.patch` a bit and ran a few >> tests. I didn't manage to break anything, except one mysterious error that I've >> got only once on one of my replicas, but I couldn't reproduce it yet. >> Interesting thing is that this error did not affect another replica or primary. >> Just in case here is the log for this error (maybe you can see something >> obvious, that I've not noticed): >> >> LOG: could not remove directory "pg_tblspc/47733/PG_10_201707211/47732": >> Directory not empty >> ... > > Hmm. The first error ("could not remove directory") could perhaps be > explained by temporary files from concurrent backends. > ... > Perhaps in your testing you accidentally copied a pgdata directory over the > top of it while it was running? In any case I'm struggling to see how > anything in this patch would affect anything at the REDO level.
Hmm...no, I don't think so. Basically what I was doing is just running `installcheck` against a primary instance (I assume there is nothing wrong with this approach, am I right?). This particular error was caused by `tablespace` test which was failed in this case: ``` INSERT INTO testschema.foo VALUES(1); ERROR: could not open file "pg_tblspc/16388/PG_11_201709191/16386/16390": No such file or directory ``` I tried few more times, and I've got it two times from four attempts on a fresh installation (when all instances were on the same machine). But anyway I'll try to investigate, maybe it has something to do with my environment. > > * Also I noticed that some time-related values are hardcoded (e.g. 50%/25% > > time shift when we're dealing with leases). Does it make sense to move > > them out and make them configurable? > > These numbers are interrelated, and I think they're best fixed in that > ratio. You could make it more adjustable, but I think it's better to > keep it simple with just a single knob. Ok, but what do you think about converting them to constants to make them more self explanatory? Like: ``` /* + * Since this timestamp is being sent to the standby where it will be + * compared against a time generated by the standby's system clock, we + * must consider clock skew. We use 25% of the lease time as max + * clock skew, and we subtract that from the time we send with the + * following reasoning: + */ +int max_clock_skew = synchronous_replay_lease_time * MAX_CLOCK_SKEW_PORTION; ``` Also I have another question. I tried to test this patch little bit more, and I've got some strange behaviour after pgbench (here is the full output [1]): ``` # primary $ ./bin/pgbench -s 100 -i test NOTICE: table "pgbench_history" does not exist, skipping NOTICE: table "pgbench_tellers" does not exist, skipping NOTICE: table "pgbench_accounts" does not exist, skipping NOTICE: table "pgbench_branches" does not exist, skipping creating tables... 100000 of 10000000 tuples (1%) done (elapsed 0.11 s, remaining 10.50 s) 200000 of 10000000 tuples (2%) done (elapsed 1.06 s, remaining 52.00 s) 300000 of 10000000 tuples (3%) done (elapsed 1.88 s, remaining 60.87 s) 2017-09-30 15:47:26.884 CEST [6035] LOG: revoking synchronous replay lease for standby "walreceiver"... 2017-09-30 15:47:26.900 CEST [6035] LOG: standby "walreceiver" is no longer available for synchronous replay 2017-09-30 15:47:26.903 CEST [6197] LOG: revoking synchronous replay lease for standby "walreceiver"... 400000 of 10000000 tuples (4%) done (elapsed 2.44 s, remaining 58.62 s) 2017-09-30 15:47:27.979 CEST [6197] LOG: standby "walreceiver" is no longer available for synchronous replay ``` ``` # replica 2017-09-30 15:47:51.802 CEST [6034] FATAL: could not receive data from WAL stream: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. 2017-09-30 15:47:55.154 CEST [6030] LOG: invalid magic number 0000 in log segment 000000010000000000000020, offset 10092544 2017-09-30 15:47:55.257 CEST [10508] LOG: started streaming WAL from primary at 0/20000000 on timeline 1 2017-09-30 15:48:09.622 CEST [10508] FATAL: could not receive data from WAL stream: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. ``` Is it something well known or unrelated to the patch itself? [1]: https://gist.github.com/erthalion/cdc9357f7437171192348239eb4db764