On 2017-05-27 01:35, Mark Kirkwood wrote:
On 26/05/17 20:09, Erik Rijkers wrote:

this whole thing 100x

Some questions that might help me get it right:
- do you think we need to stop and start the instances every time?
- do we need to init pgbench each time?
- could we just drop the subscription and publication and truncate the replica tables instead?

I have done all that in earler versions.

I deliberately added these 'complications' in view of the intractability of the problem: my fear is that an earlier failure leaves some half-failed state behind in an instance, which then might cause more failure. This would undermine the intent of the whole exercise (which is to count succes/failure rate). So it is important to be as sure as possible that each cycle starts out as cleanly as possible.

- what scale pgbench are you running?

I use a small script to call the main script; at the moment it does something like:
-------------------
duration=60
from=1
to=100
for scale in 25 5
do
  for clients in 90 64 8
  do
    date_str=$(date +"%Y%m%d_%H%M")
    outfile=out_${date_str}.txt
    time for x in `seq $from $to`
    do
        ./pgbench_derail2.sh $scale $clients $duration $date_str
[...]
-------------------

- how many clients for the 1 min pgbench run?

see above

- are you starting the pgbench run while the copy_data jobs for the subscription are still running?

I assume with copy_data you mean the data sync of the original table before pgbench starts.
And yes, I think here might be the origin of the problem.
( I think the problem I get is actually easily avoided by putting wait states here and there in between separate steps. But the testing idea here is to force the system into error, not to avoid any errors)

- how exactly are you calculating those md5's?

Here is the bash function: cb (I forget what that stands for, I guess 'content bench'). $outf is a log file to which the program writes output:

---------------------------
function cb()
{
  #  display the 4 pgbench tables' accumulated content as md5s
  #  a,b,t,h stand for:  pgbench_accounts, -branches, -tellers, -history
num_tables=$( echo "select count(*) from pg_class where relkind = 'r' and relname ~ '^pgbench_'" | psql -qtAX )
  if [[ $num_tables -ne 4 ]]
  then
     echo "pgbench tables not 4 - exit" >> $outf
     exit
  fi
  for port in $port1 $port2
  do
md5_a=$(echo "select * from pgbench_accounts order by aid"|psql -qtAXp $port|md5sum|cut -b 1-9) md5_b=$(echo "select * from pgbench_branches order by bid"|psql -qtAXp $port|md5sum|cut -b 1-9) md5_t=$(echo "select * from pgbench_tellers order by tid"|psql -qtAXp $port|md5sum|cut -b 1-9) md5_h=$(echo "select * from pgbench_history order by hid"|psql -qtAXp $port|md5sum|cut -b 1-9) cnt_a=$(echo "select count(*) from pgbench_accounts" |psql -qtAXp $port) cnt_b=$(echo "select count(*) from pgbench_branches" |psql -qtAXp $port) cnt_t=$(echo "select count(*) from pgbench_tellers" |psql -qtAXp $port) cnt_h=$(echo "select count(*) from pgbench_history" |psql -qtAXp $port) md5_total[$port]=$( echo "${md5_a} ${md5_b} ${md5_t} ${md5_h}" | md5sum )
    printf "$port a,b,t,h: %8d %6d %6d %6d" $cnt_a $cnt_b $cnt_t $cnt_h
    echo -n "  $md5_a $md5_b $md5_t $md5_h"
    if   [[ $port -eq $port1 ]]; then echo    " master"
    elif [[ $port -eq $port2 ]]; then echo -n " replica"
    else                              echo    "           ERROR  "
    fi
  done
  if [[ "${md5_total[$port1]}" == "${md5_total[$port2]}" ]]
  then
    echo " ok"
  else
    echo " NOK"
  fi
}
---------------------------

this enables:

echo "-- getting md5 (cb)"
cb_text1=$(cb)

and testing that string like:

    if echo "$cb_text1" | grep -qw 'replica ok';
    then
       echo "-- All is well."

[...]


Later today I'll try to clean up the whole thing and post it.














--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to