Thanks for all. We found the source of the problem. It was mistake in upgrade to 9.6.
We upgrade replica with rsync as it is in the documentation: rsync --verbose --relative --archive --hard-links --size-only old_pgdata new_pgdata remote_dir We must provide 100% read-only availability of our shard (master + 2 replicas). So we can’t stop master and both replicas, upgrade them one by one and start them. We do it as follows: Close master from load, stop master, upgrade it, stop 1st replica, upgrade it, start 1st replica, stop 2nd replica, upgrade it, start 2nd replica, start master, open master. But upgraded replicas died under load without statistics and we decided to perform analyze on master before upgrading replicas. In this case statistics would be copied to replicas by rsync. The upgrade algorithm became as follows: Close master, stop master, close master from replicas (iptables), upgrade master, start master, perform analyze, stop master, stop 1st replica, upgrade 1st replica, start 1st replica, stop 2nd replica, upgrade 2nd replica, start 2nd replica, start master, open master. If autovacuum starts vacuuming relations while we are performing analyze, wal records generated by it will not be replayed on replicas, because next step is stopping master with checkpoint and new redo location LSN (newer that these wal records) will appear in pg_control file, which then will be copied by rsync to replicas. If it was simple vacuum, we most likely will not see the consequences. Because it marks tuples as deleted, and some of the next new tuples will be placed here, and due to FPW replicas will receive correct page, identical to master. But if it was vacuum to prevent wraparound, we will see situation like ours. Tuples on master will be frozen, but on replicas not. And it will not change if nobody will not update any tuple on this page. It’s dangerous, because, if we perform switchover to replica, «corrupted» page will be delivered to all replicas after next update of any tuple from this page. We reproduced this case in our test environment and this assumption was confirmed. Starting and stopping master after running pg_upgrade but before rsync to collect statistics was a bad idea. We know how to find such «corrupted» tuples. And we want to fix this by manually freezing tuples via calling specially written C functions. Functions are «copy-pasted» and simplified code from vacuum functions with SQL interface (see attachment). Can you look on them? Do you think it is safe to use them for fixing corrupted pages or is there a better way not to loose data? Regards, Dmitriy Sarafannikov
Description: Binary data
-- Sent via pgsql-hackers mailing list (email@example.com) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers