On Sat, Jun 1, 2019 at 8:25 PM Tom K <tomk...@gmail.com> wrote: > So the best bet will be trying to get through this error then: > > [ PSQL02 ] > PANIC:replication check point has wrong magic 0 instead of 307747550 > > > > > On Sat, Jun 1, 2019 at 8:21 PM Tom K <tomk...@gmail.com> wrote: > >> >> >> On Sat, Jun 1, 2019 at 7:34 PM Adrian Klaver <adrian.kla...@aklaver.com> >> wrote: >> >>> On 6/1/19 4:22 PM, Tom K wrote: >>> > >>> > >>> >>> > >>> > Looks like this crash was far more catastrophic then I thought. By >>> the >>> > looks of things, thinking on psql02 would be my best bet. >>> > >>> >>> The more I look at it the more I think the replication was not doing >>> what you thought it was doing. That psql02 was the primary and that >>> psql01 and psql03 where out of sync and/or defunct standbys. >>> >> >> Now that I look at the files myself, that's the conclusion I was coming >> to myself. Sample config: >> >> [root@psql02 base]# cat /etc/patroni.yml >> scope: postgres >> namespace: /db/ >> name: postgresql1 >> >> restapi: >> listen: 192.168.0.124:8008 >> connect_address: 192.168.0.124:8008 >> >> etcd: >> host: 192.168.0.124:2379 >> >> bootstrap: >> dcs: >> ttl: 30 >> loop_wait: 10 >> retry_timeout: 10 >> maximum_lag_on_failover: 1048576 >> postgresql: >> use_pg_rewind: true >> >> initdb: >> - encoding: UTF8 >> - data-checksums >> >> pg_hba: >> - host replication replicator 127.0.0.1/32 md5 >> - host replication replicator 192.168.0.108/0 md5 >> - host replication replicator 192.168.0.124/0 md5 >> - host replication replicator 192.168.0.118/0 md5 >> - host all all 0.0.0.0/0 md5 >> >> users: >> admin: >> password: admin >> options: >> - createrole >> - createdb >> >> postgresql: >> listen: 192.168.0.124:5432 >> bin_dir: /usr/pgsql-10/bin >> connect_address: 192.168.0.124:5432 >> data_dir: /data/patroni >> pgpass: /tmp/pgpass >> unix_socket_directories: /data/patroni >> authentication: >> replication: >> username: replicator >> password: rep-pass >> superuser: >> username: postgres >> password: <SECRET> >> parameters: >> unix_socket_directories: '.' >> >> tags: >> nofailover: false >> noloadbalance: false >> clonefrom: false >> nosync: false >> [root@psql02 base]# >> >> Or perhaps when the system crashed, the filesystem check simply moved the >> folders out due to corruption. >> >> >>> >>> Trying what we did above but on the second node:
2019-06-01 20:28:45.305 EDT [21745] LOG: database system was interrupted; last known up at 2019-04-28 06:06:24 EDT 2019-06-01 20:28:46.142 EDT [21745] DEBUG: primary_slot_name = 'postgresql1' 2019-06-01 20:28:46.142 EDT [21745] DEBUG: standby_mode = 'on' 2019-06-01 20:28:46.142 EDT [21745] DEBUG: recovery_target_timeline = latest 2019-06-01 20:28:46.142 EDT [21745] WARNING: recovery command file "recovery.conf" specified neither primary_conninfo nor restore_command 2019-06-01 20:28:46.142 EDT [21745] HINT: The database server will regularly poll the pg_wal subdirectory to check for files placed there. 2019-06-01 20:28:46.142 EDT [21745] LOG: entering standby mode 2019-06-01 20:28:46.142 EDT [21745] LOG: invalid record length at 0/4C35CDF8: wanted 24, got 0 2019-06-01 20:28:46.142 EDT [21745] LOG: invalid primary checkpoint record 2019-06-01 20:28:46.142 EDT [21745] LOG: using previous checkpoint record at 0/4C34EDA8 2019-06-01 20:28:46.142 EDT [21745] DEBUG: redo record is at 0/4C34ED70; shutdown FALSE 2019-06-01 20:28:46.143 EDT [21745] DEBUG: next transaction ID: 0:1409831; next OID: 237578 2019-06-01 20:28:46.143 EDT [21745] DEBUG: next MultiXactId: 48; next MultiXactOffset: 174 2019-06-01 20:28:46.143 EDT [21745] DEBUG: oldest unfrozen transaction ID: 549, in database 1 2019-06-01 20:28:46.143 EDT [21745] DEBUG: oldest MultiXactId: 1, in database 1 2019-06-01 20:28:46.143 EDT [21745] DEBUG: commit timestamp Xid oldest/newest: 0/0 2019-06-01 20:28:46.143 EDT [21745] DEBUG: transaction ID wrap limit is 2147484196, limited by database with OID 1 2019-06-01 20:28:46.143 EDT [21745] DEBUG: MultiXactId wrap limit is 2147483648, limited by database with OID 1 2019-06-01 20:28:46.143 EDT [21745] DEBUG: starting up replication slots 2019-06-01 20:28:46.143 EDT [21745] DEBUG: starting up replication origin progress state 2019-06-01 20:28:46.143 EDT [21745] PANIC: replication checkpoint has wrong magic 0 instead of 307747550 2019-06-01 20:28:46.143 EDT [21743] DEBUG: reaping dead processes 2019-06-01 20:28:46.143 EDT [21743] LOG: startup process (PID 21745) was terminated by signal 6: Aborted 2019-06-01 20:28:46.143 EDT [21743] LOG: aborting startup due to startup process failure 2019-06-01 20:28:46.144 EDT [21743] DEBUG: shmem_exit(1): 0 before_shmem_exit callbacks to make 2019-06-01 20:28:46.144 EDT [21743] DEBUG: shmem_exit(1): 5 on_shmem_exit callbacks to make 2019-06-01 20:28:46.144 EDT [21743] DEBUG: cleaning up dynamic shared memory control segment with ID 222930115 2019-06-01 20:28:46.145 EDT [21743] DEBUG: proc_exit(1): 2 callbacks to make 2019-06-01 20:28:46.146 EDT [21743] LOG: database system is shut down 2019-06-01 20:28:46.146 EDT [21743] DEBUG: exit(1) 2019-06-01 20:28:46.146 EDT [21743] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make 2019-06-01 20:28:46.146 EDT [21743] DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make 2019-06-01 20:28:46.146 EDT [21743] DEBUG: proc_exit(-1): 0 callbacks to make 2019-06-01 20:28:46.147 EDT [21744] DEBUG: logger shutting down 2019-06-01 20:28:46.147 EDT [21744] DEBUG: shmem_exit(0): 0 before_shmem_exit callbacks to make 2019-06-01 20:28:46.147 EDT [21744] DEBUG: shmem_exit(0): 0 on_shmem_exit callbacks to make 2019-06-01 20:28:46.147 EDT [21744] DEBUG: proc_exit(0): 0 callbacks to make 2019-06-01 20:28:46.147 EDT [21744] DEBUG: exit(0) 2019-06-01 20:28:46.147 EDT [21744] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make 2019-06-01 20:28:46.147 EDT [21744] DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make 2019-06-01 20:28:46.147 EDT [21744] DEBUG: proc_exit(-1): 0 callbacks to make ^C -bash-4.2$ Removing the recovery.conf and restarting yields the following messages: 2019-06-01 20:31:12.231 EDT [21910] LOG: database system was interrupted; last known up at 2019-04-28 06:06:24 EDT 2019-06-01 20:31:12.505 EDT [21910] LOG: invalid record length at 0/4C35CDF8: wanted 24, got 0 2019-06-01 20:31:12.505 EDT [21910] LOG: invalid primary checkpoint record 2019-06-01 20:31:12.505 EDT [21910] LOG: using previous checkpoint record at 0/4C34EDA8 2019-06-01 20:31:12.505 EDT [21910] DEBUG: redo record is at 0/4C34ED70; shutdown FALSE 2019-06-01 20:31:12.505 EDT [21910] DEBUG: next transaction ID: 0:1409831; next OID: 237578 2019-06-01 20:31:12.505 EDT [21910] DEBUG: next MultiXactId: 48; next MultiXactOffset: 174 2019-06-01 20:31:12.505 EDT [21910] DEBUG: oldest unfrozen transaction ID: 549, in database 1 2019-06-01 20:31:12.505 EDT [21910] DEBUG: oldest MultiXactId: 1, in database 1 2019-06-01 20:31:12.505 EDT [21910] DEBUG: commit timestamp Xid oldest/newest: 0/0 2019-06-01 20:31:12.505 EDT [21910] DEBUG: transaction ID wrap limit is 2147484196, limited by database with OID 1 2019-06-01 20:31:12.505 EDT [21910] DEBUG: MultiXactId wrap limit is 2147483648, limited by database with OID 1 2019-06-01 20:31:12.505 EDT [21910] DEBUG: starting up replication slots 2019-06-01 20:31:12.505 EDT [21910] DEBUG: starting up replication origin progress state 2019-06-01 20:31:12.505 EDT [21910] PANIC: replication checkpoint has wrong magic 0 instead of 307747550 2019-06-01 20:31:12.506 EDT [21908] DEBUG: reaping dead processes 2019-06-01 20:31:12.506 EDT [21908] LOG: startup process (PID 21910) was terminated by signal 6: Aborted 2019-06-01 20:31:12.506 EDT [21908] LOG: aborting startup due to startup process failure 2019-06-01 20:31:12.506 EDT [21908] DEBUG: shmem_exit(1): 0 before_shmem_exit callbacks to make 2019-06-01 20:31:12.506 EDT [21908] DEBUG: shmem_exit(1): 5 on_shmem_exit callbacks to make 2019-06-01 20:31:12.506 EDT [21908] DEBUG: cleaning up dynamic shared memory control segment with ID 976986759 2019-06-01 20:31:12.508 EDT [21908] DEBUG: proc_exit(1): 2 callbacks to make 2019-06-01 20:31:12.508 EDT [21908] LOG: database system is shut down 2019-06-01 20:31:12.508 EDT [21908] DEBUG: exit(1) 2019-06-01 20:31:12.508 EDT [21908] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make 2019-06-01 20:31:12.508 EDT [21908] DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make 2019-06-01 20:31:12.508 EDT [21908] DEBUG: proc_exit(-1): 0 callbacks to make 2019-06-01 20:31:12.510 EDT [21909] DEBUG: logger shutting down 2019-06-01 20:31:12.510 EDT [21909] DEBUG: shmem_exit(0): 0 before_shmem_exit callbacks to make 2019-06-01 20:31:12.510 EDT [21909] DEBUG: shmem_exit(0): 0 on_shmem_exit callbacks to make 2019-06-01 20:31:12.510 EDT [21909] DEBUG: proc_exit(0): 0 callbacks to make 2019-06-01 20:31:12.510 EDT [21909] DEBUG: exit(0) 2019-06-01 20:31:12.510 EDT [21909] DEBUG: shmem_exit(-1): 0 before_shmem_exit callbacks to make 2019-06-01 20:31:12.510 EDT [21909] DEBUG: shmem_exit(-1): 0 on_shmem_exit callbacks to make 2019-06-01 20:31:12.510 EDT [21909] DEBUG: proc_exit(-1): 0 callbacks to make > -- >>> Adrian Klaver >>> adrian.kla...@aklaver.com >>> >>