In my case, I'm talking about creating a standby -> standby clone and then recover the wals . How can I run checkpoint in the secondary if it isnt up ?
בתאריך יום ד׳, 29 במאי 2019 ב-13:54 מאת Nikolay Samokhvalov < samokhva...@gmail.com>: > Hello Mariel, > > 1) Have you tried to run “CHECKPOINT;” on the standby to perform > restartpoint explicitly? It is possible. > > 2) If we talk about streaming replication, do you use replication slots? > If so, have you checked pg_replication_slots and pg_stat_replication? They > can help to troubleshoot streaming replication delays — see columns > sent_location, write_location, flush_location, and replay_location in > pg_stat_replication and restart_lsn in pg_replication_slots. If you have > delay in replaying, it should be seen there. > > On Wed, May 29, 2019 at 11:39 Mariel Cherkassky < > mariel.cherkas...@gmail.com> wrote: > >> Is there any messages that indicates that the secondary replayed a >> specific wal ? "restored 00000..." means that the restore_command succeeded >> but there isnt any proof that it replayed the wal. >> >> My theory regarding the issue : >> It seems, that my customer stopped the db 20 minutes after the clone >> have finished. During those 20 minutes the secondary didnt get enough wal >> records (6 wal files) so it didnt reach the max_wal_size. My >> checkpoint_timeout is set to 30minutes, therefore there wasnt any >> checkpoint. As a result of that the secondary didnt reach a restart point. >> Does that sounds reasonable ? >> >> So basically, if I clone a small primary db, the secondary would reach a >> restart point only if it reached a checkpoint (checkpoint_timeout or >> max_wal_size). However, I have cloned many small dbs and saw the it takes a >> sec to start the secondary (which means that restartpoint was reached). So >> what am I missing ? >> >> בתאריך יום ד׳, 29 במאי 2019 ב-11:20 מאת Fabio Pardi < >> f.pa...@portavita.eu>: >> >>> >>> >>> On 5/29/19 9:20 AM, Mariel Cherkassky wrote: >>> > First of all thanks Fabio. >>> > I think that I'm missing something : >>> > In the next questionI'm not talking about streaming replication,rather >>> > on recovery : >>> > >>> > 1.When the secondary get the wals from the primary it tries to replay >>> > them correct ? >>> >>> >>> correct >>> >>> > >>> > 2. By replaying it just go over the wal records and run them in the >>> > secondary ? >>> > >>> >>> correct >>> >>> > 3.All those changes are saved in the shared_buffer(secondary) or the >>> > changed are immediately done on the data files blocks ? >>> > >>> >>> the changes are not saved to your datafile yet. That happens at >>> checkpoint time. >>> >>> > 4.The secondary will need a checkpoint in order to flush those changes >>> > to the data files and in order to reach a restart point ? >>> > >>> >>> yes >>> >>> > So, basically If I had a checkpoint during the clone, the secondary >>> > should also have a checkpoint when I recover the secondary right ? >>> > >>> >>> correct. Even after being in sync with master, if you restart Postgres >>> on standby, it will then re-apply the WAL files from the last checkpoint. >>> >>> In the logfile of the standby, you will see as many messages reporting >>> "restored log file" as many WAL files were produced since the last >>> checkpoint >>> >>> Hope it helps to clarify. >>> >>> regards, >>> >>> fabio pardi >>> > >>> > בתאריך יום ג׳, 28 במאי 2019 ב-13:54 מאת Fabio Pardi >>> > <f.pa...@portavita.eu <mailto:f.pa...@portavita.eu>>: >>> > >>> > Hi Mariel, >>> > >>> > please keep the list posted. When you reply, use 'reply all'. That >>> > will maybe help others in the community and you might also get more >>> > help from others. >>> > >>> > answers are in line here below >>> > >>> > >>> > >>> > On 28/05/2019 10:54, Mariel Cherkassky wrote: >>> > > I have pg 9.6, repmgr version 4.3 . >>> > > I see in the logs that wal files are restored : >>> > > 2019-05-22 12:35:12 EEST 60942 LOG: restored log file >>> > "000000010000377B000000DB" from archive >>> > > that means that the restore_command worked right ? >>> > > >>> > >>> > right >>> > >>> > > According to the docs : >>> > > "In standby mode, a restartpoint is also triggered >>> > if checkpoint_segments log segments have been replayed since last >>> > restartpoint and at least one checkpoint record has been replayed. >>> > Restartpoints can't be performed more frequently than checkpoints >>> in >>> > the master because restartpoints can only be performed at >>> checkpoint >>> > records" >>> > > so maybe I should decrease max_wal_size or even >>> checkpoint_timeout >>> > to force a restartpoint ? >>> > > During this gap (standby clone) 6-7 wals were generated on the >>> primary >>> > > >>> > >>> > >>> > From what you posted earlier, you should in any case have hit a >>> > checkpoint every 30 minutes. (That was also the assumption in the >>> > previous messages. If that's not happening, then i would really >>> > investigate.) >>> > >>> > That is, if during your cloning only a few WAL files were >>> generated, >>> > then it is not enough to trigger a checkpoint and you fallback to >>> > the 30 minutes. >>> > >>> > I would not be bothered if i were you, but can always force a >>> > checkpoint on the master issuing: >>> > >>> > CHECKPOINT ; >>> > >>> > at that stage, on the standby logs you will see the messages: >>> > >>> > restartpoint starting: .. >>> > >>> > restartpoint complete: .. >>> > >>> > >>> > >>> > regards, >>> > >>> > fabio pardi >>> > >>> > > >>> > > בתאריך יום ב׳, 27 במאי 2019 ב-14:04 מאת Fabio Pardi >>> > <f.pa...@portavita.eu <mailto:f.pa...@portavita.eu> >>> > <mailto:f.pa...@portavita.eu <mailto:f.pa...@portavita.eu>>>: >>> > > >>> > > If you did not even see this messages on your standby logs: >>> > > >>> > > restartpoint starting: xlog >>> > > >>> > > then it means that the checkpoint was even never started. >>> > > >>> > > In that case, I have no clue. >>> > > >>> > > Try to describe step by step how to reproduce the problem >>> > together with >>> > > your setup and the version number of Postgres and repmgr, and >>> > i might be >>> > > able to help you further. >>> > > >>> > > regards, >>> > > >>> > > fabio pardi >>> > > >>> > > On 5/27/19 12:17 PM, Mariel Cherkassky wrote: >>> > > > standby_mode = 'on' >>> > > > primary_conninfo = 'host=X.X.X.X user=repmgr >>> > connect_timeout=10 ' >>> > > > recovery_target_timeline = 'latest' >>> > > > primary_slot_name = repmgr_slot_1 >>> > > > restore_command = 'rsync -avzhe ssh >>> > > > postgres@x.x.x.x:/var/lib/pgsql/archive/%f >>> > /var/lib/pgsql/archive/%f ; >>> > > > gunzip < /var/lib/pgsql/archive/%f > %p' >>> > > > archive_cleanup_command = >>> '/usr/pgsql-9.6/bin/pg_archivecleanup >>> > > > /var/lib/pgsql/archive %r' >>> > > > >>> > > > בתאריך יום ב׳, 27 במאי 2019 ב-12:29 מאת Fabio Pardi >>> > > > <f.pa...@portavita.eu <mailto:f.pa...@portavita.eu> >>> > <mailto:f.pa...@portavita.eu <mailto:f.pa...@portavita.eu>> >>> > <mailto:f.pa...@portavita.eu <mailto:f.pa...@portavita.eu> >>> > <mailto:f.pa...@portavita.eu <mailto:f.pa...@portavita.eu>>>>: >>> > > > >>> > > > Hi Mariel, >>> > > > >>> > > > let s keep the list in cc... >>> > > > >>> > > > settings look ok. >>> > > > >>> > > > what's in the recovery.conf file then? >>> > > > >>> > > > regards, >>> > > > >>> > > > fabio pardi >>> > > > >>> > > > On 5/27/19 11:23 AM, Mariel Cherkassky wrote: >>> > > > > Hey, >>> > > > > the configuration is the same as in the primary : >>> > > > > max_wal_size = 2GB >>> > > > > min_wal_size = 1GB >>> > > > > wal_buffers = 16MB >>> > > > > checkpoint_completion_target = 0.9 >>> > > > > checkpoint_timeout = 30min >>> > > > > >>> > > > > Regarding your question, I didnt see this message >>> > (consistent recovery >>> > > > > state reached at), I guess thats why the secondary >>> > isnt avaialble >>> > > > yet.. >>> > > > > >>> > > > > Maybe I'm wrong, but what I understood from the >>> > documentation- restart >>> > > > > point is generated only after the secondary had a >>> > checkpoint wihch >>> > > > means >>> > > > > only after 30 minutes or after max_wal_size is >>> reached >>> > ? But >>> > > > still, why >>> > > > > wont the secondary reach a consisteny recovery state >>> > (does it >>> > > > requires a >>> > > > > restart point to be generated ? ) >>> > > > > >>> > > > > >>> > > > > בתאריך יום ב׳, 27 במאי 2019 ב-12:12 מאת Fabio >>> Pardi >>> > > > > <f.pa...@portavita.eu <mailto:f.pa...@portavita.eu> >>> > <mailto:f.pa...@portavita.eu <mailto:f.pa...@portavita.eu>> >>> > <mailto:f.pa...@portavita.eu <mailto:f.pa...@portavita.eu> >>> > <mailto:f.pa...@portavita.eu <mailto:f.pa...@portavita.eu>>> >>> > > > <mailto:f.pa...@portavita.eu >>> > <mailto:f.pa...@portavita.eu> <mailto:f.pa...@portavita.eu >>> > <mailto:f.pa...@portavita.eu>> <mailto:f.pa...@portavita.eu >>> > <mailto:f.pa...@portavita.eu> <mailto:f.pa...@portavita.eu >>> > <mailto:f.pa...@portavita.eu>>>>>: >>> > > > > >>> > > > > Hi Mariel, >>> > > > > >>> > > > > if i m not wrong, on the secondary you will see >>> > the messages you >>> > > > > mentioned when a checkpoint happens. >>> > > > > >>> > > > > What are checkpoint_timeout and max_wal_size on >>> > your standby? >>> > > > > >>> > > > > Did you ever see this on your standby log? >>> > > > > >>> > > > > "consistent recovery state reached at .." >>> > > > > >>> > > > > >>> > > > > Maybe you can post your whole configuration of >>> > your standby >>> > > > for easier >>> > > > > debug. >>> > > > > >>> > > > > regards, >>> > > > > >>> > > > > fabio pardi >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > On 5/27/19 10:49 AM, Mariel Cherkassky wrote: >>> > > > > > Hey, >>> > > > > > PG 9.6, I have a standalone configured. I tried >>> > to start up a >>> > > > > secondary, >>> > > > > > run standby clone (repmgr). The clone process >>> > took 3 hours >>> > > > and during >>> > > > > > that time wals were generated(mostly because >>> of the >>> > > > > checkpoint_timeout). >>> > > > > > As a result of that, when I start the secondary >>> > ,I see that the >>> > > > > > secondary keeps getting the wals but I dont see >>> > any messages >>> > > > that >>> > > > > > indicate that the secondary tried to replay the >>> > wals. >>> > > > > > messages that i see : >>> > > > > > receiving incremental file list >>> > > > > > 000000010000377B000000DE >>> > > > > > >>> > > > > > sent 30 bytes received 4.11M bytes 8.22M >>> bytes/sec >>> > > > > > total size is 4.15M speedup is 1.01 >>> > > > > > 2019-05-22 12:48:10 EEST 60942 LOG: restored >>> > log file >>> > > > > > "000000010000377B000000DE" from archive >>> > > > > > 2019-05-22 12:48:11 EEST db63311 FATAL: the >>> > database system is >>> > > > > starting up >>> > > > > > 2019-05-22 12:48:12 EEST db63313 FATAL: the >>> > database system is >>> > > > > > starting up >>> > > > > > >>> > > > > > I was hoping to see the following messages >>> > (taken from a >>> > > > different >>> > > > > > machine) : >>> > > > > > 2019-05-27 01:15:37 EDT 7428 LOG: >>> > restartpoint starting: time >>> > > > > > 2019-05-27 01:16:18 EDT 7428 LOG: >>> > restartpoint complete: >>> > > > wrote 406 >>> > > > > > buffers (0.2%); 1 transaction log file(s) >>> added, >>> > 0 removed, 0 >>> > > > > recycled; >>> > > > > > write=41.390 s, sync=0.001 s, total=41.582 s; >>> > sync file >>> > > > > > s=128, longest=0.000 s, average=0.000 s; >>> > distance=2005 kB, >>> > > > > estimate=2699 kB >>> > > > > > 2019-05-27 01:16:18 EDT 7428 LOG: recovery >>> > restart point at >>> > > > > 4/D096C4F8 >>> > > > > > >>> > > > > > My primary settings(wals settings) : >>> > > > > > wal_buffers = 16MB >>> > > > > > checkpoint_completion_target = 0.9 >>> > > > > > checkpoint_timeout = 30min >>> > > > > > >>> > > > > > Any idea what can explain why the secondary >>> > doesnt replay >>> > > > the wals ? >>> > > > > >>> > > > > >>> > > > >>> > > >>> > >>> >>