Thanks for all.

We found the source of the problem. It was mistake in upgrade to 9.6.

We upgrade replica with rsync as it is in the documentation:
rsync --verbose --relative --archive --hard-links --size-only old_pgdata 
new_pgdata remote_dir

We must provide 100% read-only availability of our shard (master + 2 replicas).
So we can’t stop master and both replicas, upgrade them one by one and start 
them.
We do it as follows:
Close master from load, stop master, upgrade it, stop 1st replica, upgrade it, 
start 1st replica,
stop 2nd replica, upgrade it, start 2nd replica, start master, open master.
But upgraded replicas died under load without statistics and we decided to 
perform
analyze on master before upgrading replicas. In this case statistics would be 
copied to replicas by rsync.
The upgrade algorithm became as follows:
Close master, stop master, close master from replicas (iptables), upgrade 
master,
start master, perform analyze, stop master, stop 1st replica, upgrade 1st 
replica,
start 1st replica, stop 2nd replica, upgrade 2nd replica, start 2nd replica,
start master, open master.

If autovacuum starts vacuuming relations while we are performing analyze, wal 
records
generated by it will not be replayed on replicas, because next step is stopping
master with checkpoint and new redo location LSN (newer that these wal records)
will appear in pg_control file, which then will be copied by rsync to replicas.

If it was simple vacuum, we most likely will not see the consequences. Because 
it marks
tuples as deleted, and some of the next new tuples will be placed here, and due 
to FPW
replicas will receive correct page, identical to master.
But if it was vacuum to prevent wraparound, we will see situation like ours. 
Tuples on
master will be frozen, but on replicas not. And it will not change if nobody 
will not
update any tuple on this page.

It’s dangerous, because, if we perform switchover to replica, «corrupted» page
will be delivered to all replicas after next update of any tuple from this page.

We reproduced this case in our test environment and this assumption was 
confirmed.

Starting and stopping master after running pg_upgrade but before rsync to 
collect statistics
was a bad idea.

We know how to find such «corrupted» tuples. And we want to fix this by manually
freezing tuples via calling specially written C functions. Functions are 
«copy-pasted»
and simplified code from vacuum functions with SQL interface (see attachment).
Can you look on them? Do you think it is safe to use them for fixing corrupted 
pages
or is there a better way not to loose data?

Regards,
Dmitriy Sarafannikov

Attachment: freeze_tuple.c
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to