Re: long-standing data loss bug in initial sync of logical replication

Nitin Motiani Thu, 11 Jul 2024 05:50:10 -0700

On Wed, Jul 10, 2024 at 11:22 PM Nitin Motiani <nitinmoti...@google.com> wrote:
>
> On Wed, Jul 10, 2024 at 10:39 PM vignesh C <vignes...@gmail.com> wrote:
> >
> > On Wed, 10 Jul 2024 at 12:28, Amit Kapila <amit.kapil...@gmail.com> wrote:
> > > The patch missed to use the ShareRowExclusiveLock for partitions, see
> > > attached. I haven't tested it but they should also face the same
> > > problem. Apart from that, I have changed the comments in a few places
> > > in the patch.
> >
> > I could not hit the updated ShareRowExclusiveLock changes through the
> > partition table, instead I could verify it using the inheritance
> > table. Added a test for the same and also attaching the backbranch
> > patch.
> >
>
> Hi,
>
> I tested alternative-experimental-fix-lock.patch provided by Tomas
> (replaces SUE with SRE in OpenTableList). I believe there are a couple
> of scenarios the patch does not cover.
>
> 1. It doesn't handle the case of "ALTER PUBLICATION <pub> ADD TABLES
> IN SCHEMA  <schema>".
>
> I took crash-test.sh provided by Tomas and modified it to add all
> tables in the schema to publication using the following command :
>
>            ALTER PUBLICATION p ADD TABLES IN SCHEMA  public
>
> The modified script is attached (crash-test-with-schema.sh). With this
> script, I can reproduce the issue even with the patch applied. This is
> because the code path to add a schema to the publication doesn't go
> through OpenTableList.
>
> I have also attached a script run-test-with-schema.sh to run
> crash-test-with-schema.sh in a loop with randomly generated parameters
> (modified from run.sh provided by Tomas).
>
> 2.  The second issue is a deadlock which happens when the alter
> publication command is run for a comma separated list of tables.
>
> I created another script create-test-tables-order-reverse.sh. This
> script runs a command like the following :
>
>             ALTER PUBLICATION p ADD TABLE test_2,test_1
>
> Running the above script, I was able to get a deadlock error (the
> output is attached in deadlock.txt). In the alter publication command,
> I added the tables in the reverse order to increase the probability of
> the deadlock. But it should happen with any order of tables.
>
> I am not sure if the deadlock is a major issue because detecting the
> deadlock is better than data loss. The schema issue is probably more
> important. I didn't test it out with the latest patches sent by
> Vignesh but since the code changes in that patch are also in
> OpenTableList, I think the schema scenario won't be covered by those.
>



Hi,

I looked further into the scenario of adding the tables in schema to
the publication. Since in that case, the entry is added to
pg_publication_namespace instead of pg_publication_rel, the codepaths
for 'add table' and 'add tables in schema' are different. And in the
'add tables in schema' scenario, the OpenTableList function is not
called to get the relation ids. Therefore even with the proposed
patch, the data loss issue still persists in that case.

To validate this idea, I tried locking all the affected tables in the
schema just before the invalidation for those relations (in
ShareRowExclusiveLock mode). I am attaching the small patch for that
(alter_pub_for_schema.patch) where the change is made in the function
publication_add_schema in pg_publication.c. I am not sure if this is
the best place to make this change or if it is the right fix. It is
conceptually similar to the proposed change in OpenTableList but here
we are not just changing the lockmode but taking locks which were not
taken before. But with this change, the data loss errors went away in
my test script.

Another issue which persists with this change is the deadlock. Since
multiple table locks are acquired, the test script detects deadlock a
few times. Therefore I'm also attaching another modified script which
does a few retries in case of deadlock. The script is
crash-test-with-retries-for-schema.sh. It runs the following command
in a retry loop :

              ALTER PUBLICATION p ADD TABLES IN SCHEMA  public

If the command fails, it sleeps for a random amount of time (upper
bound by a MAXWAIT parameter) and then retries the command. If it
fails to run the command in the max number of retries, the final
return value from the script is DEADLOCK as we can't do a consistency
check in this scenario. Also attached is another script
run-with-deadlock-detection.sh which can run the above script for
multiple iterations.

I tried the test scripts with and without alter_pub_for_schema.patch.
Without the patch, I get the final output ERROR majority of the time
which means that the publication was altered successfully but the data
was lost on the subscriber. When I run it with the patch, I get a mix
of OK (no data loss) and DEADLOCK (the publication was not altered)
but no ERROR. I think by changing the parameters of sleep time and
number of retries we can get different fractions of OK and DEADLOCK.

I am not sure if this is the right or a clean way to fix the issue but
I think conceptually this might be the right direction. Please let me
know if my understanding is wrong or if I'm missing something.

Thanks & Regards,
Nitin Motiani
Google

alter_pub_for_schema.patch
Description: Binary data

#!/usr/bin/bash
#
# Runs crash-test-with-retries-for-schema.sh with randomized parameters in a loop, and ensures
# each run completes within 15 minutes (to prevent infinite lockups).
#
# The randomized parameters are
#
# number of tables - 10 to 110
# refresh interval - 10 to 110 tables
# sleep length     - 1 to 11 seconds
#
# Each run stores data in a separate directory, derived from timestamp.
#

RUNS=$1
TIMEOUT=900

#for v in 12 14 16 master; do
for v in master; do

	PUB_VERSION=$v
	SUB_VERSION=$v

	for r in `seq 1 $RUNS`; do

		t=$((10 + RANDOM % 100))	# number of tables
		m=$((10 + RANDOM % 100))	# publication refresh interval
		s=$((1 + RANDOM % 10))		# sleep between steps (seconds)

		OUTDIR=`date +%Y%m%d-%H%M%S`

		if [ -d "$OUTDIR" ]; then
			echo "directory $OUTDIR already exists"
			exit
		fi

		mkdir $OUTDIR

		# make sure the test is terminated after 1h (or change 
		sleep $TIMEOUT && kill `pgrep crash-test-with-retries-for-schema.sh` &
		pid=$!

		# run the test with random parameters
		./crash-test-with-retries-for-schema.sh $OUTDIR $t $m  $s 5 30 > $OUTDIR/run.log 2>&1
		r=$?

		kill $pid

		# make sure it's dead
		killall -9 -q postgres

		if [ "$r" == "0" ]; then
			echo $OUTDIR $t $m $s : OK
		elif [ "$r" == "1" ]; then
			echo $OUTDIR $t $m $s : ERROR
		else
			echo $OUTDIR $t $m $s : DEADLOCK
		fi

	done

done

#!/usr/bin/bash
#
# ./crash-test.sh OUTPUT_DIRECTORY NUMBER_OF_TABLES REFRESH_INTERVAL SLEEP_SECONDS
#
# OUTPUT_DIRECTORY - directory where all the data directories / logs will be
# NUMBER_OF_TABLES - number of tables to create, add to publication, ...
# REFRESH_INTERVAL - call REFRESH PUBLICATION after adding this many tables
# SLEEP_SECONDS    - number of seconds to sleep between steps
# 
#
# What this does (briefly)
#
# - initialize and start two clusters - publisher and subscriber
# - create publication/subscription between them
# - start pgbench in the background, to generate changes
# - create given number of tables, add them to the publication
# - once in a while refresh the subscription to sync tables
# - stop the pgbench
# - wait for the subscriber to fully catch up
# - compare data on publisher/subscriber (simple cross-check)
# - if everything went OK, remove the data directories etc.
# - if there was error, keep the data (but compress, to save space)
#

OUTDIR=$1
NUMTABLES=$2
REFRESH=$3
SLEEP=$4
NUMRETRIES=$5
MAXWAIT=$6

echo $PATH

START_TIMESTAMP=`date +%s`

echo [`date`] [`date +%s`] "NUMTABLES=$NUMTABLES  SLEEP=$SLEEP  REFRESH=$REFRESH"

# config
PUB_PORT=5001
PUB_DATA=$OUTDIR/data-pub

SUB_PORT=5002
SUB_DATA=$OUTDIR/data-sub

# initialize clusters
pg_ctl -D $PUB_DATA init
pg_ctl -D $SUB_DATA init

echo 'wal_level = logical' >> $PUB_DATA/postgresql.conf
echo "port = $PUB_PORT" >> $PUB_DATA/postgresql.conf
echo "log_line_prefix = '%n %m [%p] [%b] [%a] [%v/%x] '" >> $PUB_DATA/postgresql.conf

echo "port = $SUB_PORT" >> $SUB_DATA/postgresql.conf
echo "log_line_prefix = '%n %m [%p] [%b] [%a] [%v/%x] '" >> $SUB_DATA/postgresql.conf

pg_ctl -D $PUB_DATA -w -l $OUTDIR/pg-pub.log start
pg_ctl -D $SUB_DATA -w -l $OUTDIR/pg-sub.log start

createdb -p $PUB_PORT test > $OUTDIR/init-pub.log 2>&1
createdb -p $SUB_PORT test > $OUTDIR/init-sub.log 2>&1

# create table on both ends
for t in `seq 1 $NUMTABLES`; do
	psql -p $PUB_PORT test -c "CREATE TABLE test_${t} (id SERIAL PRIMARY KEY, ts timestamptz, lsn pg_lsn, val TEXT)"
	psql -p $SUB_PORT test -c "CREATE TABLE test_${t} (id SERIAL PRIMARY KEY, ts timestamptz, lsn pg_lsn, val TEXT)"
done

# generate insert script for pgbench
echo > insert.sql
for t in `seq 1 $NUMTABLES`; do
	echo "INSERT INTO test_${t} (ts, lsn, val) VALUES (clock_timestamp(), pg_current_wal_insert_lsn(), md5(random()::text));" >> insert.sql
done

sort -R insert.sql > insert.tmp && mv insert.tmp insert.sql

echo "BEGIN;" > insert.tmp
cat insert.sql >> insert.tmp
echo "COMMIT;" >> insert.tmp
mv insert.tmp insert.sql

mv insert.sql $OUTDIR/

# run pgbench on the source in the background (5m is overkill, we will terminate)
pgbench -n -c 8 -f $OUTDIR/insert.sql -p $PUB_PORT -T 3600 test > $OUTDIR/pgbench.log 2>&1 &

# create publication on 12 and add the table to it
echo [`date`] [`date +%s`] "creating publication"
psql -p $PUB_PORT test -c "CREATE PUBLICATION p"

# sleep for a few seconds
echo [`date`] [`date +%s`] "sleeping"
sleep $SLEEP

# add create subscription on 14
echo [`date`] [`date +%s`] "creating subscription"
psql -p $SUB_PORT test -c "CREATE SUBSCRIPTION s CONNECTION 'application_name=test host=localhost user=nitinmotiani dbname=test port=$PUB_PORT' PUBLICATION p"

# sleep for a few seconds
echo [`date`] [`date +%s`] "sleeping"
sleep $SLEEP

num_retries=$NUMRETRIES
curr=0
until [ "$curr" -ge "$num_retries" ]
do
	echo "Attempt number : $curr"
	psql -p $PUB_PORT test -c "ALTER PUBLICATION p ADD TABLES IN SCHEMA public" && break
	curr=$((curr+1))
	sleep $((RANDOM%$MAXWAIT))
done
deadlock=0
if [ "$curr" == "$num_retries" ]; then
	deadlock=1
fi
refreshed=0
for t in `seq 1 $NUMTABLES`; do

	lsn=`psql -p $PUB_PORT test -t -A -c "SELECT pg_current_wal_lsn()"`
	#echo [`date`] [`date +%s`] "adding table $t to publication [$lsn]"

	#psql -p $PUB_PORT test -c "ALTER PUBLICATION p ADD TABLE test_${t}"

	m=$((t % REFRESH))

	if [ "$m" == "0" ] && [ "$refreshed" == "0" ]; then

		lsn=`psql -p $PUB_PORT test -t -A -c "SELECT pg_current_wal_lsn()"`
		echo [`date`] [`date +%s`] "refreshing publication [$lsn] in loop"

		psql -p $SUB_PORT test -c "ALTER SUBSCRIPTION s REFRESH PUBLICATION"
		psql -p $PUB_PORT test -c "SELECT pg_current_wal_lsn() AS wal_lsn, pg_current_wal_insert_lsn() AS insert_lsn, (SELECT relname FROM pg_class AS c WHERE c.oid = p.prrelid) AS relname, * FROM pg_publication_rel p ORDER BY prrelid" >> $OUTDIR/pg_publication_rel.log 2>&1
		refreshed=1
	fi

done

if [ "$refreshed" == "0" ]; then
	lsn=`psql -p $PUB_PORT test -t -A -c "SELECT pg_current_wal_lsn()"`
	echo [`date`] [`date +%s`] "refreshing publication [$lsn]"
	psql -p $SUB_PORT test -c "ALTER SUBSCRIPTION s REFRESH PUBLICATION"
fi

psql -p $PUB_PORT test -c "SELECT pg_current_wal_lsn() AS wal_lsn, pg_current_wal_insert_lsn() AS insert_lsn, (SELECT relname FROM pg_class AS c WHERE c.oid = p.prrelid) AS relname, * FROM pg_publication_rel p ORDER BY prrelid" >> $OUTDIR/pg_publication_rel.log 2>&1

# wait for all tables to get in sync (essentially what wait_for_subscription_sync does)
test_total=`psql -p $SUB_PORT test -t -A -c "SELECT count(*) FROM pg_subscription_rel"`
echo "Total num entries in the rel table are $test_total"
while /bin/true; do
	echo [`date`] [`date +%s`] "waiting for subscription"
	c=`psql -p $SUB_PORT test -t -A -c "SELECT count(*) FROM pg_subscription_rel WHERE srsubstate = 'r'"`
	if [ "$c" == "$test_total" ]; then
		break
	fi
	echo [`date`] [`date +%s`] "synced $c tables out of $NUMTABLES"
	sleep 1
done

# sleep for a few more seconds, to generate more pgbench changes
echo [`date`] [`date +%s`] "sleeping"
sleep $SLEEP

# kill the pgbench, and wait for all the associated backends to terminate
killall pgbench

while /bin/true; do
	echo [`date`] [`date +%s`] "waiting for pgbench backends to die"
	r=`psql -p $PUB_PORT test -t -A -c "SELECT count(*) FROM pg_stat_activity WHERE application_name = 'pgbench'"`
	if [ "$r" == "0" ]; then
		break
	fi
	sleep 1
done

# get current LSN on publisher and wait for subscriber to fully catchup (same as wait_for_catchup)
lsn=`psql -p $PUB_PORT test -t -A -c "SELECT pg_current_wal_lsn()"`

echo [`date`] [`date +%s`] "waiting for replay of LSN $lsn"

start_lsn=`psql -p $PUB_PORT test -t -A -c "SELECT replay_lsn FROM pg_catalog.pg_stat_replication WHERE application_name = 'test'"`
start_time=`date +%s`

while /bin/true; do
	if [ "$deadlock" != 0 ]; then
		break
	fi
	echo [`date`] [`date +%s`] "waiting for catchup $lsn"
	replay_lsn=`psql -p $PUB_PORT test -t -A -c "SELECT replay_lsn FROM pg_catalog.pg_stat_replication WHERE application_name = 'test'"`
	remaining_wal=`psql -p $PUB_PORT test -t -A -c "SELECT ('$lsn' - replay_lsn) FROM pg_catalog.pg_stat_replication WHERE application_name = 'test'"`
	remaining_kb=$((remaining_wal/1024))

	current_time=`date +%s`
	delta_time=$((current_time - start_time + 1))
	delta_wal=`psql -p $PUB_PORT test -t -A -c "SELECT (replay_lsn - '$start_lsn') FROM pg_catalog.pg_stat_replication WHERE application_name = 'test'"`
	total_wal=`psql -p $PUB_PORT test -t -A -c "SELECT ('$lsn'::pg_lsn - '$start_lsn'::pg_lsn) FROM pg_catalog.pg_stat_replication WHERE application_name = 'test'"`
	speed=$((delta_wal / delta_time))

	if [ "$speed" != "0" ]; then
		remaining_sec=$((remaining_wal / speed))
	else
		remaining_sec='???'
	fi

	r=`psql -p $PUB_PORT test -t -A -c "SELECT replay_lsn >= '$lsn' AND state = 'streaming' FROM pg_catalog.pg_stat_replication WHERE application_name = 'test'"`
	if [ "$r" == "t" ]; then
		break
	fi
	echo [`date`] [`date +%s`] "replay LSN $replay_lsn, remaining $remaining_kb kB / $remaining_sec seconds"
	sleep 1
done

# dump basic info about subscription and subscriber

psql -p $SUB_PORT test -c "SELECT oid, relname FROM pg_class AS c WHERE relname like 'test_%' ORDER BY relname" > $OUTDIR/pg_class.log 2>&1

psql -p $SUB_PORT test -c "SELECT (SELECT relname FROM pg_class AS c WHERE c.oid = r.srrelid) AS relname, * FROM pg_subscription_rel AS r ORDER BY 1" > $OUTDIR/pg_subscription_rel.log 2>&1

psql -p $PUB_PORT test -c "SELECT pg_current_wal_lsn() AS wal_lsn, pg_current_wal_insert_lsn() AS insert_lsn, * FROM pg_stat_replication" > $OUTDIR/pg_stat_replication.log 2>&1

# check consistency - compare number of records, sum, ... could be made more thorough, but good enough
echo [`date`] [`date +%s`] "caught up, check consistency"
r="OK"
c=0
for t in `seq 1 $NUMTABLES`; do
	a=`psql -p $PUB_PORT test -t -A -c "SELECT count(*), sum(id) FROM test_${t}"`
	b=`psql -p $SUB_PORT test -t -A -c "SELECT count(*), sum(id) FROM test_${t}"`
	if [ "$a" == "$b" ]; then
		echo [`date`] [`date +%s`] "$t OK ($a $b)"
	else
		echo [`date`] [`date +%s`] "$t ERROR ($a $b)"
		r="ERROR"
		c=1
	fi
done

if [ "$deadlock" == "1" ]; then
	r="DEADLOCK"
	c=2
fi

# cleanup tables on both ends
for t in `seq 1 $NUMTABLES`; do
	psql -p $PUB_PORT test -c "SELECT * FROM test_${t} ORDER BY id" > $OUTDIR/test_${t}_pub.data
	psql -p $SUB_PORT test -c "SELECT * FROM test_${t} ORDER BY id" > $OUTDIR/test_${t}_sub.data
done

pg_ctl -D $PUB_DATA -w -m immediate stop
pg_ctl -D $SUB_DATA -w -m immediate stop

# if went OK, remove the data directories
if [ "$r" == "OK" ]; then
	rm -Rf $PUB_DATA
	rm -Rf $SUB_DATA
	# remove the log files too, may be pretty large due to debugging
	rm $OUTDIR/pg-pub.log $OUTDIR/pg-sub.log
else
	# for error, compress the data directories
	tar -c $PUB_DATA | gzip --fast > $PUB_DATA.tgz
	tar -c $SUB_DATA | gzip --fast > $SUB_DATA.tgz
	# and also the data directories
	pigz --fast $OUTDIR/pg-pub.log $OUTDIR/pg-sub.log
fi

date

END_TIMESTAMP=`date +%s`

echo [`date`] [`date +%s`] "RUNTIME: " $((END_TIMESTAMP - START_TIMESTAMP))
echo [`date`] [`date +%s`] "RESULT: $r"

exit $c

Re: long-standing data loss bug in initial sync of logical replication

Reply via email to