Re: [HACKERS] Causal reads take II

Dmitry Dolgov Sat, 30 Sep 2017 13:07:28 -0700

> On 31 July 2017 at 07:49, Thomas Munro <thomas.mu...@enterprisedb.com>
wrote:
>> On Sun, Jul 30, 2017 at 7:07 AM, Dmitry Dolgov <9erthali...@gmail.com>
wrote:
>>
>> I looked through the code of `synchronous-replay-v1.patch` a bit and ran
a few
>> tests. I didn't manage to break anything, except one mysterious error
that I've
>> got only once on one of my replicas, but I couldn't reproduce it yet.
>> Interesting thing is that this error did not affect another replica or
primary.
>> Just in case here is the log for this error (maybe you can see something
>> obvious, that I've not noticed):
>>
>> LOG:  could not remove directory "pg_tblspc/47733/PG_10_201707211/47732":
>> Directory not empty
>> ...
>
> Hmm.  The first error ("could not remove directory") could perhaps be
> explained by temporary files from concurrent backends.
> ...
> Perhaps in your testing you accidentally copied a pgdata directory over
the
> top of it while it was running?  In any case I'm struggling to see how
> anything in this patch would affect anything at the REDO level.


Hmm...no, I don't think so. Basically what I was doing is just running
`installcheck` against a primary instance (I assume there is nothing wrong
with
this approach, am I right?). This particular error was caused by
`tablespace`
test which was failed in this case:

```
INSERT INTO testschema.foo VALUES(1);
ERROR:  could not open file "pg_tblspc/16388/PG_11_201709191/16386/16390":
No such file or directory
```

I tried few more times, and I've got it two times from four attempts on a
fresh
installation (when all instances were on the same machine). But anyway I'll
try
to investigate, maybe it has something to do with my environment.

> > * Also I noticed that some time-related values are hardcoded (e.g.
50%/25%
> >   time shift when we're dealing with leases). Does it make sense to move
> >   them out and make them configurable?
>
> These numbers are interrelated, and I think they're best fixed in that
> ratio.  You could make it more adjustable, but I think it's better to
> keep it simple with just a single knob.

Ok, but what do you think about converting them to constants to make them
more
self explanatory? Like:

```
/*
+ * Since this timestamp is being sent to the standby where it will be
+ * compared against a time generated by the standby's system clock, we
+ * must consider clock skew.  We use 25% of the lease time as max
+ * clock skew, and we subtract that from the time we send with the
+ * following reasoning:
+ */
+int max_clock_skew = synchronous_replay_lease_time *
MAX_CLOCK_SKEW_PORTION;
```

Also I have another question. I tried to test this patch little bit more,
and
I've got some strange behaviour after pgbench (here is the full output [1]):

```
# primary

$ ./bin/pgbench -s 100 -i test

NOTICE:  table "pgbench_history" does not exist, skipping
NOTICE:  table "pgbench_tellers" does not exist, skipping
NOTICE:  table "pgbench_accounts" does not exist, skipping
NOTICE:  table "pgbench_branches" does not exist, skipping
creating tables...
100000 of 10000000 tuples (1%) done (elapsed 0.11 s, remaining 10.50 s)
200000 of 10000000 tuples (2%) done (elapsed 1.06 s, remaining 52.00 s)
300000 of 10000000 tuples (3%) done (elapsed 1.88 s, remaining 60.87 s)
2017-09-30 15:47:26.884 CEST [6035] LOG:  revoking synchronous replay lease
for standby "walreceiver"...
2017-09-30 15:47:26.900 CEST [6035] LOG:  standby "walreceiver" is no
longer available for synchronous replay
2017-09-30 15:47:26.903 CEST [6197] LOG:  revoking synchronous replay lease
for standby "walreceiver"...
400000 of 10000000 tuples (4%) done (elapsed 2.44 s, remaining 58.62 s)
2017-09-30 15:47:27.979 CEST [6197] LOG:  standby "walreceiver" is no
longer available for synchronous replay
```

```
# replica

2017-09-30 15:47:51.802 CEST [6034] FATAL:  could not receive data from WAL
stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
2017-09-30 15:47:55.154 CEST [6030] LOG:  invalid magic number 0000 in log
segment 000000010000000000000020, offset 10092544
2017-09-30 15:47:55.257 CEST [10508] LOG:  started streaming WAL from
primary at 0/20000000 on timeline 1
2017-09-30 15:48:09.622 CEST [10508] FATAL:  could not receive data from
WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
```

Is it something well known or unrelated to the patch itself?

[1]: https://gist.github.com/erthalion/cdc9357f7437171192348239eb4db764

Re: [HACKERS] Causal reads take II

Reply via email to