Hello Mariel, 1) Have you tried to run “CHECKPOINT;” on the standby to perform restartpoint explicitly? It is possible.
2) If we talk about streaming replication, do you use replication slots? If so, have you checked pg_replication_slots and pg_stat_replication? They can help to troubleshoot streaming replication delays — see columns sent_location, write_location, flush_location, and replay_location in pg_stat_replication and restart_lsn in pg_replication_slots. If you have delay in replaying, it should be seen there. On Wed, May 29, 2019 at 11:39 Mariel Cherkassky <mariel.cherkas...@gmail.com> wrote: > Is there any messages that indicates that the secondary replayed a > specific wal ? "restored 00000..." means that the restore_command succeeded > but there isnt any proof that it replayed the wal. > > My theory regarding the issue : > It seems, that my customer stopped the db 20 minutes after the clone have > finished. During those 20 minutes the secondary didnt get enough wal > records (6 wal files) so it didnt reach the max_wal_size. My > checkpoint_timeout is set to 30minutes, therefore there wasnt any > checkpoint. As a result of that the secondary didnt reach a restart point. > Does that sounds reasonable ? > > So basically, if I clone a small primary db, the secondary would reach a > restart point only if it reached a checkpoint (checkpoint_timeout or > max_wal_size). However, I have cloned many small dbs and saw the it takes a > sec to start the secondary (which means that restartpoint was reached). So > what am I missing ? > > בתאריך יום ד׳, 29 במאי 2019 ב-11:20 מאת Fabio Pardi < > f.pa...@portavita.eu>: > >> >> >> On 5/29/19 9:20 AM, Mariel Cherkassky wrote: >> > First of all thanks Fabio. >> > I think that I'm missing something : >> > In the next questionI'm not talking about streaming replication,rather >> > on recovery : >> > >> > 1.When the secondary get the wals from the primary it tries to replay >> > them correct ? >> >> >> correct >> >> > >> > 2. By replaying it just go over the wal records and run them in the >> > secondary ? >> > >> >> correct >> >> > 3.All those changes are saved in the shared_buffer(secondary) or the >> > changed are immediately done on the data files blocks ? >> > >> >> the changes are not saved to your datafile yet. That happens at >> checkpoint time. >> >> > 4.The secondary will need a checkpoint in order to flush those changes >> > to the data files and in order to reach a restart point ? >> > >> >> yes >> >> > So, basically If I had a checkpoint during the clone, the secondary >> > should also have a checkpoint when I recover the secondary right ? >> > >> >> correct. Even after being in sync with master, if you restart Postgres >> on standby, it will then re-apply the WAL files from the last checkpoint. >> >> In the logfile of the standby, you will see as many messages reporting >> "restored log file" as many WAL files were produced since the last >> checkpoint >> >> Hope it helps to clarify. >> >> regards, >> >> fabio pardi >> > >> > בתאריך יום ג׳, 28 במאי 2019 ב-13:54 מאת Fabio Pardi >> > <f.pa...@portavita.eu <mailto:f.pa...@portavita.eu>>: >> > >> > Hi Mariel, >> > >> > please keep the list posted. When you reply, use 'reply all'. That >> > will maybe help others in the community and you might also get more >> > help from others. >> > >> > answers are in line here below >> > >> > >> > >> > On 28/05/2019 10:54, Mariel Cherkassky wrote: >> > > I have pg 9.6, repmgr version 4.3 . >> > > I see in the logs that wal files are restored : >> > > 2019-05-22 12:35:12 EEST 60942 LOG: restored log file >> > "000000010000377B000000DB" from archive >> > > that means that the restore_command worked right ? >> > > >> > >> > right >> > >> > > According to the docs : >> > > "In standby mode, a restartpoint is also triggered >> > if checkpoint_segments log segments have been replayed since last >> > restartpoint and at least one checkpoint record has been replayed. >> > Restartpoints can't be performed more frequently than checkpoints in >> > the master because restartpoints can only be performed at checkpoint >> > records" >> > > so maybe I should decrease max_wal_size or even checkpoint_timeout >> > to force a restartpoint ? >> > > During this gap (standby clone) 6-7 wals were generated on the >> primary >> > > >> > >> > >> > From what you posted earlier, you should in any case have hit a >> > checkpoint every 30 minutes. (That was also the assumption in the >> > previous messages. If that's not happening, then i would really >> > investigate.) >> > >> > That is, if during your cloning only a few WAL files were generated, >> > then it is not enough to trigger a checkpoint and you fallback to >> > the 30 minutes. >> > >> > I would not be bothered if i were you, but can always force a >> > checkpoint on the master issuing: >> > >> > CHECKPOINT ; >> > >> > at that stage, on the standby logs you will see the messages: >> > >> > restartpoint starting: .. >> > >> > restartpoint complete: .. >> > >> > >> > >> > regards, >> > >> > fabio pardi >> > >> > > >> > > בתאריך יום ב׳, 27 במאי 2019 ב-14:04 מאת Fabio Pardi >> > <f.pa...@portavita.eu <mailto:f.pa...@portavita.eu> >> > <mailto:f.pa...@portavita.eu <mailto:f.pa...@portavita.eu>>>: >> > > >> > > If you did not even see this messages on your standby logs: >> > > >> > > restartpoint starting: xlog >> > > >> > > then it means that the checkpoint was even never started. >> > > >> > > In that case, I have no clue. >> > > >> > > Try to describe step by step how to reproduce the problem >> > together with >> > > your setup and the version number of Postgres and repmgr, and >> > i might be >> > > able to help you further. >> > > >> > > regards, >> > > >> > > fabio pardi >> > > >> > > On 5/27/19 12:17 PM, Mariel Cherkassky wrote: >> > > > standby_mode = 'on' >> > > > primary_conninfo = 'host=X.X.X.X user=repmgr >> > connect_timeout=10 ' >> > > > recovery_target_timeline = 'latest' >> > > > primary_slot_name = repmgr_slot_1 >> > > > restore_command = 'rsync -avzhe ssh >> > > > postgres@x.x.x.x:/var/lib/pgsql/archive/%f >> > /var/lib/pgsql/archive/%f ; >> > > > gunzip < /var/lib/pgsql/archive/%f > %p' >> > > > archive_cleanup_command = >> '/usr/pgsql-9.6/bin/pg_archivecleanup >> > > > /var/lib/pgsql/archive %r' >> > > > >> > > > בתאריך יום ב׳, 27 במאי 2019 ב-12:29 מאת Fabio Pardi >> > > > <f.pa...@portavita.eu <mailto:f.pa...@portavita.eu> >> > <mailto:f.pa...@portavita.eu <mailto:f.pa...@portavita.eu>> >> > <mailto:f.pa...@portavita.eu <mailto:f.pa...@portavita.eu> >> > <mailto:f.pa...@portavita.eu <mailto:f.pa...@portavita.eu>>>>: >> > > > >> > > > Hi Mariel, >> > > > >> > > > let s keep the list in cc... >> > > > >> > > > settings look ok. >> > > > >> > > > what's in the recovery.conf file then? >> > > > >> > > > regards, >> > > > >> > > > fabio pardi >> > > > >> > > > On 5/27/19 11:23 AM, Mariel Cherkassky wrote: >> > > > > Hey, >> > > > > the configuration is the same as in the primary : >> > > > > max_wal_size = 2GB >> > > > > min_wal_size = 1GB >> > > > > wal_buffers = 16MB >> > > > > checkpoint_completion_target = 0.9 >> > > > > checkpoint_timeout = 30min >> > > > > >> > > > > Regarding your question, I didnt see this message >> > (consistent recovery >> > > > > state reached at), I guess thats why the secondary >> > isnt avaialble >> > > > yet.. >> > > > > >> > > > > Maybe I'm wrong, but what I understood from the >> > documentation- restart >> > > > > point is generated only after the secondary had a >> > checkpoint wihch >> > > > means >> > > > > only after 30 minutes or after max_wal_size is reached >> > ? But >> > > > still, why >> > > > > wont the secondary reach a consisteny recovery state >> > (does it >> > > > requires a >> > > > > restart point to be generated ? ) >> > > > > >> > > > > >> > > > > בתאריך יום ב׳, 27 במאי 2019 ב-12:12 מאת Fabio >> Pardi >> > > > > <f.pa...@portavita.eu <mailto:f.pa...@portavita.eu> >> > <mailto:f.pa...@portavita.eu <mailto:f.pa...@portavita.eu>> >> > <mailto:f.pa...@portavita.eu <mailto:f.pa...@portavita.eu> >> > <mailto:f.pa...@portavita.eu <mailto:f.pa...@portavita.eu>>> >> > > > <mailto:f.pa...@portavita.eu >> > <mailto:f.pa...@portavita.eu> <mailto:f.pa...@portavita.eu >> > <mailto:f.pa...@portavita.eu>> <mailto:f.pa...@portavita.eu >> > <mailto:f.pa...@portavita.eu> <mailto:f.pa...@portavita.eu >> > <mailto:f.pa...@portavita.eu>>>>>: >> > > > > >> > > > > Hi Mariel, >> > > > > >> > > > > if i m not wrong, on the secondary you will see >> > the messages you >> > > > > mentioned when a checkpoint happens. >> > > > > >> > > > > What are checkpoint_timeout and max_wal_size on >> > your standby? >> > > > > >> > > > > Did you ever see this on your standby log? >> > > > > >> > > > > "consistent recovery state reached at .." >> > > > > >> > > > > >> > > > > Maybe you can post your whole configuration of >> > your standby >> > > > for easier >> > > > > debug. >> > > > > >> > > > > regards, >> > > > > >> > > > > fabio pardi >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > On 5/27/19 10:49 AM, Mariel Cherkassky wrote: >> > > > > > Hey, >> > > > > > PG 9.6, I have a standalone configured. I tried >> > to start up a >> > > > > secondary, >> > > > > > run standby clone (repmgr). The clone process >> > took 3 hours >> > > > and during >> > > > > > that time wals were generated(mostly because of >> the >> > > > > checkpoint_timeout). >> > > > > > As a result of that, when I start the secondary >> > ,I see that the >> > > > > > secondary keeps getting the wals but I dont see >> > any messages >> > > > that >> > > > > > indicate that the secondary tried to replay the >> > wals. >> > > > > > messages that i see : >> > > > > > receiving incremental file list >> > > > > > 000000010000377B000000DE >> > > > > > >> > > > > > sent 30 bytes received 4.11M bytes 8.22M >> bytes/sec >> > > > > > total size is 4.15M speedup is 1.01 >> > > > > > 2019-05-22 12:48:10 EEST 60942 LOG: restored >> > log file >> > > > > > "000000010000377B000000DE" from archive >> > > > > > 2019-05-22 12:48:11 EEST db63311 FATAL: the >> > database system is >> > > > > starting up >> > > > > > 2019-05-22 12:48:12 EEST db63313 FATAL: the >> > database system is >> > > > > > starting up >> > > > > > >> > > > > > I was hoping to see the following messages >> > (taken from a >> > > > different >> > > > > > machine) : >> > > > > > 2019-05-27 01:15:37 EDT 7428 LOG: >> > restartpoint starting: time >> > > > > > 2019-05-27 01:16:18 EDT 7428 LOG: >> > restartpoint complete: >> > > > wrote 406 >> > > > > > buffers (0.2%); 1 transaction log file(s) added, >> > 0 removed, 0 >> > > > > recycled; >> > > > > > write=41.390 s, sync=0.001 s, total=41.582 s; >> > sync file >> > > > > > s=128, longest=0.000 s, average=0.000 s; >> > distance=2005 kB, >> > > > > estimate=2699 kB >> > > > > > 2019-05-27 01:16:18 EDT 7428 LOG: recovery >> > restart point at >> > > > > 4/D096C4F8 >> > > > > > >> > > > > > My primary settings(wals settings) : >> > > > > > wal_buffers = 16MB >> > > > > > checkpoint_completion_target = 0.9 >> > > > > > checkpoint_timeout = 30min >> > > > > > >> > > > > > Any idea what can explain why the secondary >> > doesnt replay >> > > > the wals ? >> > > > > >> > > > > >> > > > >> > > >> > >> >