[HACKERS] logical replication - still unstable after all these months

Erik Rijkers Thu, 25 May 2017 23:11:40 -0700

If you run a pgbench session of 1 minute over a logical replicationconnection and repeat that 100x this is what you get:


At clients 90, 64, 8, scale 25:


-- out_20170525_0944.txt
    100 -- pgbench -c 90 -j 8 -T 60 -P 12 -n   --  scale 25
     93 -- All is well.
      7 -- Not good.
-- out_20170525_1426.txt
    100 -- pgbench -c 64 -j 8 -T 60 -P 12 -n   --  scale 25
     82 -- All is well.
     18 -- Not good.
-- out_20170525_2049.txt
    100 -- pgbench -c 8 -j 8 -T 60 -P 12 -n   --  scale 25
     90 -- All is well.
     10 -- Not good


At clients 90, 64, 8, scale 25:

-- out_20170526_0126.txt
    100 -- pgbench -c 90 -j 8 -T 60 -P 12 -n   --  scale 5
     98 -- All is well.
      2 -- Not good.
-- out_20170526_0352.txt
    100 -- pgbench -c 64 -j 8 -T 60 -P 12 -n   --  scale 5
     97 -- All is well.
      3 -- Not good.
-- out_20170526_0621.txt
     45 -- pgbench -c 8 -j 8 -T 60 -P 12 -n   --  scale 5
     41 -- All is well.
      3 -- Not good.

        (That last one obviously not finished)


I think this is pretty awful, really, for a beta level.

The above installations (master+replica) are with Petr Jelinek's (andMichael Paquier's) last patches

 0001-Fix-signal-handling-in-logical-workers.patch
 0002-Make-tablesync-worker-exit-when-apply-dies-while-it-.patch
 0003-Receive-invalidation-messages-correctly-in-tablesync.patch
 Remove-the-SKIP-REFRESH-syntax-suggar-in-ALTER-SUBSC-v2.patch

Now, it could be that there is somehow something wrong with mytest-setup (as opposed to some bug in log-repl). I can post my testprogram, but I'll do that separately (but below is the core all my tests-- it's basically still that very first test that I started out with,many months ago...)



I'd like to find out/know more about:
- Do you agree this number of failures is far too high?
- Am I the only one finding so many failures?

- Is anyone else testing the same way (more or less continually, findingonly succes)?- Which of the Open Items could be resposible for this failure rate? (Idon't see a match.)- What tests do others do? Could we somehow concentrate results andmethod somewhere?



Thanks,


Erik Rijkers




PS

The core of the 'pgbench_derail' test (bash) is simply:

echo "drop table if exists pgbench_accounts;
drop table if exists pgbench_branches;
drop table if exists pgbench_tellers;
drop table if exists pgbench_history;" | psql -qXp $port1 \
&& echo "drop table if exists pgbench_accounts;
drop table if exists pgbench_branches;
drop table if exists pgbench_tellers;
drop table if exists pgbench_history;" | psql -qXp $port2 \
&& pgbench -p $port1 -qis $scale \

&& echo "alter table pgbench_history add column hid serial primary key;"\

 | psql -q1Xp $port1 && pg_dump -F c -p $port1 \
    --exclude-table-data=pgbench_history  \
    --exclude-table-data=pgbench_accounts \
    --exclude-table-data=pgbench_branches \
    --exclude-table-data=pgbench_tellers  \
  -t pgbench_history -t pgbench_accounts \
  -t pgbench_branches -t pgbench_tellers \
 | pg_restore -1 -p $port2 -d testdb
appname=derail2
echo "create publication pub1 for all tables;" | psql -p $port1 -aqtAX
echo "create subscription sub1 connection 'port=${port1}
  application_name=$appname' publication pub1 with(enabled=false);
alter subscription sub1 enable;" | psql -p $port2 -aqtAX

pgbench -c $clients -j $threads -T $duration -P $pseconds -n # scale$scale

Now compare md5's of the sorted content of each of the 4 pgbench tableson primary and replica. They should be the same.




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] logical replication - still unstable after all these months

Reply via email to