Re: [HACKERS] logical replication - still unstable after all these months

Erik Rijkers Fri, 26 May 2017 19:43:55 -0700

On 2017-05-27 01:35, Mark Kirkwood wrote:

On 26/05/17 20:09, Erik Rijkers wrote:


this whole thing 100x


Some questions that might help me get it right:
- do you think we need to stop and start the instances every time?
- do we need to init pgbench each time?

- could we just drop the subscription and publication and truncate thereplica tables instead?


I have done all that in earler versions.

I deliberately added these 'complications' in view of the intractabilityof the problem: my fear is that an earlier failure leaves somehalf-failed state behind in an instance, which then might cause morefailure. This would undermine the intent of the whole exercise (whichis to count succes/failure rate). So it is important to be as sure aspossible that each cycle starts out as cleanly as possible.

- what scale pgbench are you running?

I use a small script to call the main script; at the moment it doessomething like:

-------------------
duration=60
from=1
to=100
for scale in 25 5
do
  for clients in 90 64 8
  do
    date_str=$(date +"%Y%m%d_%H%M")
    outfile=out_${date_str}.txt
    time for x in `seq $from $to`
    do
        ./pgbench_derail2.sh $scale $clients $duration $date_str
[...]
-------------------

- how many clients for the 1 min pgbench run?


see above

- are you starting the pgbench run while the copy_data jobs for thesubscription are still running?

I assume with copy_data you mean the data sync of the original tablebefore pgbench starts.

And yes, I think here might be the origin of the problem.

( I think the problem I get is actually easily avoided by putting waitstates here and there in between separate steps. But the testing ideahere is to force the system into error, not to avoid any errors)

- how exactly are you calculating those md5's?

Here is the bash function: cb (I forget what that stands for, I guess'content bench'). $outf is a log file to which the program writesoutput:


---------------------------
function cb()
{
  #  display the 4 pgbench tables' accumulated content as md5s
  #  a,b,t,h stand for:  pgbench_accounts, -branches, -tellers, -history

num_tables=$( echo "select count(*) from pg_class where relkind = 'r'and relname ~ '^pgbench_'" | psql -qtAX )

  if [[ $num_tables -ne 4 ]]
  then
     echo "pgbench tables not 4 - exit" >> $outf
     exit
  fi
  for port in $port1 $port2
  do

md5_a=$(echo "select * from pgbench_accounts order by aid"|psql-qtAXp $port|md5sum|cut -b 1-9)md5_b=$(echo "select * from pgbench_branches order by bid"|psql-qtAXp $port|md5sum|cut -b 1-9)md5_t=$(echo "select * from pgbench_tellers order by tid"|psql-qtAXp $port|md5sum|cut -b 1-9)md5_h=$(echo "select * from pgbench_history order by hid"|psql-qtAXp $port|md5sum|cut -b 1-9)cnt_a=$(echo "select count(*) from pgbench_accounts" |psql-qtAXp $port)cnt_b=$(echo "select count(*) from pgbench_branches" |psql-qtAXp $port)cnt_t=$(echo "select count(*) from pgbench_tellers" |psql-qtAXp $port)cnt_h=$(echo "select count(*) from pgbench_history" |psql-qtAXp $port)md5_total[$port]=$( echo "${md5_a} ${md5_b} ${md5_t} ${md5_h}" |md5sum )

    printf "$port a,b,t,h: %8d %6d %6d %6d" $cnt_a $cnt_b $cnt_t $cnt_h
    echo -n "  $md5_a $md5_b $md5_t $md5_h"
    if   [[ $port -eq $port1 ]]; then echo    " master"
    elif [[ $port -eq $port2 ]]; then echo -n " replica"
    else                              echo    "           ERROR  "
    fi
  done
  if [[ "${md5_total[$port1]}" == "${md5_total[$port2]}" ]]
  then
    echo " ok"
  else
    echo " NOK"
  fi
}
---------------------------

this enables:

echo "-- getting md5 (cb)"
cb_text1=$(cb)

and testing that string like:

    if echo "$cb_text1" | grep -qw 'replica ok';
    then
       echo "-- All is well."

[...]


Later today I'll try to clean up the whole thing and post it.














--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] logical replication - still unstable after all these months

Reply via email to