Re: pg_rewind docs correction
On Fri, May 1, 2020 at 4:46 AM Michael Paquier wrote: > > On Wed, Apr 29, 2020 at 09:15:06AM +0900, Michael Paquier wrote: > > I am letting that aside for a couple of days to see if others have > > more comments, and will likely commit it after an extra lookup. > > And applied after an extra lookup. Thanks for the discussion, James. Yep. Thanks for pushing to make sure it was as correct as possible while improving it. James
Re: pg_rewind docs correction
On Wed, Apr 29, 2020 at 09:15:06AM +0900, Michael Paquier wrote: > I am letting that aside for a couple of days to see if others have > more comments, and will likely commit it after an extra lookup. And applied after an extra lookup. Thanks for the discussion, James. -- Michael signature.asc Description: PGP signature
Re: pg_rewind docs correction
On Tue, Apr 28, 2020 at 12:13:38PM -0400, James Coleman wrote: > I think is missing a word. Instead of "especially the the target" > should be "especially if the target". Thanks, fixed. > In this block: > > + Create a backup_label file to begin WAL replay at > + the checkpoint created at failover and configure the > + pg_control file with a minimum consistency LSN > + defined as the result of pg_current_wal_insert_lsn() > + when rewinding from a live source and using the last checkpoint LSN > + when rewinding from a stopped source. > + > > Perhaps change "and using the last checkpoint LSN" to "or the last > checkpoint LSN". Alternatively you could make the grammar parallel by > changing to "and defined as the last checkpoint LSN", but that seems > wordy, and the "defined as [item or item]" is already a good grammar > construction. Using your first suggestion of "or the last checkpoint LSN" sounds more natural as of this morning, so updated the patch with that. I am letting that aside for a couple of days to see if others have more comments, and will likely commit it after an extra lookup. -- Michael From 48b9e85fe9ae414a0f42881558ab2fd920ba0c60 Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Wed, 29 Apr 2020 09:11:28 +0900 Subject: [PATCH v6] Improve pg_rewind explanation and warnings The pg_rewind docs currently assert that the state of the target's data directory after rewind is equivalent to the source's data directory. But that isn't quite true both because the base state is further back in time and because the target's data directory will include the current state on the source of any copied blocks. Instead using the analogy to back backups helps explain the state, as well as the pros and cons of using the utility. The How It Works section now: - Includes details about how the backup_label file is created. - Includes details about how the pg_control file is updated. - Is updated to include WAL segments and new relation files in the list of files copied wholesale from the source. Finally, document clearly the state of the cluster after the operation and also the operation sequencing dangers caused by copying configuration files from the source. --- doc/src/sgml/ref/pg_rewind.sgml | 90 + 1 file changed, 57 insertions(+), 33 deletions(-) diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml index 07c49e4719..391e5db2e2 100644 --- a/doc/src/sgml/ref/pg_rewind.sgml +++ b/doc/src/sgml/ref/pg_rewind.sgml @@ -48,14 +48,16 @@ PostgreSQL documentation - The result is equivalent to replacing the target data directory with the - source one. Only changed blocks from relation files are copied; - all other files are copied in full, including configuration files. The - advantage of pg_rewind over taking a new base backup, or - tools like rsync, is that pg_rewind does - not require reading through unchanged blocks in the cluster. This makes - it a lot faster when the database is large and only a small - fraction of blocks differ between the clusters. + After a successful rewind, the state of the target data directory is + analogous to a base backup of the source data directory. Unlike taking + a new base backup or using a tool like rsync, + pg_rewind does not require comparing or copying + unchanged relation blocks in the cluster. Only changed blocks from existing + relation files are copied; all other files, including new relation files, + configuration files, and WAL segments, are copied in full. As such the + rewind operation is significantly faster than other approaches when the + database is large and only a small fraction of blocks differ between the + clusters. @@ -77,16 +79,18 @@ PostgreSQL documentation - When the target server is started for the first time after running - pg_rewind, it will go into recovery mode and replay all - WAL generated in the source server after the point of divergence. - If some of the WAL was no longer available in the source server when - pg_rewind was run, and therefore could not be copied by the - pg_rewind session, it must be made available when the - target server is started. This can be done by creating a - recovery.signal file in the target data directory - and configuring suitable - in postgresql.conf. + After running pg_rewind, WAL replay needs to + complete for the data directory to be in a consistent state. When the + target server is started again it will enter archive recovery and replay + all WAL generated in the source server from the last checkpoint before + the point of divergence. If some of the WAL was no longer available in the + source server when pg_rewind was run, and + therefore could not be copied by the pg_rewind + session, it must be made available when the target server is started. + This can be done by creating a recovery.signal file + in
Re: pg_rewind docs correction
On Tue, Apr 28, 2020 at 12:31 AM Michael Paquier wrote: > > On Mon, Mar 09, 2020 at 09:26:17AM -0400, James Coleman wrote: > >> - pg_stat_tmp/, and > >> - pg_subtrans/ are omitted from the data copied > >> - from the source cluster. Any file or directory beginning with > >> - pgsql_tmp is omitted, as well as are > >> + pg_stat_tmp/, and > >> pg_subtrans/ > >> + are omitted from the data copied from the source cluster. The files > >> > >> This is just reorganizing an existing list, why? > >> > > > > The grammar seemed a bit awkward to me, so while I was already reworking > > this paragraph I tried to clean that up a bit. > > Thanks for the new patch, and sorry for the delay. > > Okay, I saw what you were coming at here, with one sentence for > directories, and one for files. > > > Still ongoing, correct? I guess I mentally think of them as being only one > > month, but I guess that's not actually true. Regardless I'm not sure what > > policy is for patches that have been in flight in hackers for a while but > > just missed being added to the CF app. > > This is a documentation patch, so improving this part of the docs now > is fine by me, particularly as this is an improvement. Here are more > notes from me: > - I have removed the "As with a base backup" at the beginning of the > second paragraph you modified. The first paragraph modified already > references a base backup, so one reference is enough IMO. > - WAL replay does not happen from the WAL position where WAL diverged, > but from the last checkpoint before WAL diverged. > - Did some tweaks about the new part for configuration files, as it > may actually not be necessary to update the configuration for recovery > to complete (depending on the settings of the source, the target may > just require the creation of a standby.signal file in its data > directory particularly with a common archive location for multiple > clusters). > - Some word-smithing in the step-by-step description. > > Is the updated version fine for you? In your revised patched the follow paragraph: + +As pg_rewind copies configuration files +entirely from the source, it may be required to correct the configuration +used for recovery before restarting the target server, especially the +the target is reintroduced as a standby of the source. If you restart +the server after the rewind operation has finished but without configuring +recovery, the target may again diverge from the primary. + I think is missing a word. Instead of "especially the the target" should be "especially if the target". In this block: + Create a backup_label file to begin WAL replay at + the checkpoint created at failover and configure the + pg_control file with a minimum consistency LSN + defined as the result of pg_current_wal_insert_lsn() + when rewinding from a live source and using the last checkpoint LSN + when rewinding from a stopped source. + Perhaps change "and using the last checkpoint LSN" to "or the last checkpoint LSN". Alternatively you could make the grammar parallel by changing to "and defined as the last checkpoint LSN", but that seems wordy, and the "defined as [item or item]" is already a good grammar construction. Other than those two small things, your proposed revision looks good to me. Thanks, James
Re: pg_rewind docs correction
On Mon, Mar 09, 2020 at 09:26:17AM -0400, James Coleman wrote: >> - pg_stat_tmp/, and >> - pg_subtrans/ are omitted from the data copied >> - from the source cluster. Any file or directory beginning with >> - pgsql_tmp is omitted, as well as are >> + pg_stat_tmp/, and >> pg_subtrans/ >> + are omitted from the data copied from the source cluster. The files >> >> This is just reorganizing an existing list, why? >> > > The grammar seemed a bit awkward to me, so while I was already reworking > this paragraph I tried to clean that up a bit. Thanks for the new patch, and sorry for the delay. Okay, I saw what you were coming at here, with one sentence for directories, and one for files. > Still ongoing, correct? I guess I mentally think of them as being only one > month, but I guess that's not actually true. Regardless I'm not sure what > policy is for patches that have been in flight in hackers for a while but > just missed being added to the CF app. This is a documentation patch, so improving this part of the docs now is fine by me, particularly as this is an improvement. Here are more notes from me: - I have removed the "As with a base backup" at the beginning of the second paragraph you modified. The first paragraph modified already references a base backup, so one reference is enough IMO. - WAL replay does not happen from the WAL position where WAL diverged, but from the last checkpoint before WAL diverged. - Did some tweaks about the new part for configuration files, as it may actually not be necessary to update the configuration for recovery to complete (depending on the settings of the source, the target may just require the creation of a standby.signal file in its data directory particularly with a common archive location for multiple clusters). - Some word-smithing in the step-by-step description. Is the updated version fine for you? -- Michael From 30d0e80e8e777c9b1c3f34aa281f9623e61ea17c Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Tue, 28 Apr 2020 13:29:26 +0900 Subject: [PATCH v5] Improve pg_rewind explanation and warnings The pg_rewind docs currently assert that the state of the target's data directory after rewind is equivalent to the source's data directory. But that isn't quite true both because the base state is further back in time and because the target's data directory will include the current state on the source of any copied blocks. Instead using the analogy to back backups helps explain the state, as well as the pros and cons of using the utility. The How It Works section now: - Includes details about how the backup_label file is created. - Includes details about how the pg_control file is updated. - Is updated to include WAL segments and new relation files in the list of files copied wholesale from the source. Finally, document clearly the state of the cluster after the operation and also the operation sequencing dangers caused by copying configuration files from the source. --- doc/src/sgml/ref/pg_rewind.sgml | 90 + 1 file changed, 57 insertions(+), 33 deletions(-) diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml index 07c49e4719..9525e09c98 100644 --- a/doc/src/sgml/ref/pg_rewind.sgml +++ b/doc/src/sgml/ref/pg_rewind.sgml @@ -48,14 +48,16 @@ PostgreSQL documentation - The result is equivalent to replacing the target data directory with the - source one. Only changed blocks from relation files are copied; - all other files are copied in full, including configuration files. The - advantage of pg_rewind over taking a new base backup, or - tools like rsync, is that pg_rewind does - not require reading through unchanged blocks in the cluster. This makes - it a lot faster when the database is large and only a small - fraction of blocks differ between the clusters. + After a successful rewind, the state of the target data directory is + analogous to a base backup of the source data directory. Unlike taking + a new base backup or using a tool like rsync, + pg_rewind does not require comparing or copying + unchanged relation blocks in the cluster. Only changed blocks from existing + relation files are copied; all other files, including new relation files, + configuration files, and WAL segments, are copied in full. As such the + rewind operation is significantly faster than other approaches when the + database is large and only a small fraction of blocks differ between the + clusters. @@ -77,16 +79,18 @@ PostgreSQL documentation - When the target server is started for the first time after running - pg_rewind, it will go into recovery mode and replay all - WAL generated in the source server after the point of divergence. - If some of the WAL was no longer available in the source server when - pg_rewind was run, and therefore could not be copied by the - pg_rewind session, it must be made available when t
Re: pg_rewind docs correction
On Mon, Mar 9, 2020 at 2:59 AM Michael Paquier wrote: > On Sun, Mar 08, 2020 at 05:13:21PM -0400, James Coleman wrote: > > I've added the information about how the backup label control file is > > written, and updated the How It Works steps to refer to that separately > > from restart. > > > > Additionally the How It Works is updated to include WAL segments and new > > relation files in the list of files copied wholesale, since that was > > previously stated but somewhat contradicted there. > > - The result is equivalent to replacing the target data directory with > the > - source one. Only changed blocks from relation files are copied; > - all other files are copied in full, including configuration files. The > - advantage of pg_rewind over taking a new > base backup, or > - tools like rsync, is that > pg_rewind does > - not require reading through unchanged blocks in the cluster. This makes > - it a lot faster when the database is large and only a small > - fraction of blocks differ between the clusters. > + After a successful rewind and subsequent WAL replay, the target data > + directory is equivalent to a base backup of the source data directory. > While > + only changed blocks from existing relation files are copied; all other > files > + are copied in full, including new relation files, configuration files, > and WAL > + segments. The advantage of pg_rewind over > taking a > > The first sentence you are adding refers to "subsequent WAL replay". > However, this paragraph emphasizes with the state of the target > cluster after running pg_rewind but *before* make the target cluster > start recovery. So shouldn't you just remove the part "and subsequent > WAL replay" from your first new sentence? > I'd originally typed this: I'm not sure I follow. After pg_rewind but before replay the directory is *not* equivalent to a base backup. I don't see how paragraph is clearly limited to describing what pg_rewind does. While the 2nd sentence is about pg_rewind steps specifically, the paragraph (even in the original) goes on to compare it to a base backup so we're talking about the operation in totality not just the one tool. But I realized while typing it that I was probably missing something of what you were getting at: is the hangup on calling out the WAL replay that a base backup (or rsync even) *also* requires WAL reply to reach a consistent state? I hadn't thought of that while writing this initially, so I've updated the patch to eliminate that part but also to make the analogy to base backups more direct, since it's helpful in understanding what result the tool is trying to accomplish and how it differs. In the same paragraph, I think that you should remove the "While" from > "While only changed blocks", as the second part of the sentence refers > to the other files, WAl segments, etc. > Fixed as part of the above. > The second paragraph of the docs regarding timeline lookup is > unchanged, which is fine. > > - When the target server is started for the first time after running > - pg_rewind, it will go into recovery mode > and replay all > - WAL generated in the source server after the point of divergence. > + After running pg_rewind the data directory > is > + not immediately in a consistent state. However > + pg_rewind configures the control file so > that when > + the target server is started again it will enter recovery mode and > replay all > + WAL generated in the source server after the point of divergence. > > The second part of the third paragraph is not changed, and the > modification you are doing here is about the control file. I am > still unconvinced that this is a good change, because mentioning the > control file would be actually more adapted to the part "How it > works", where you are adding details about the backup_label file, and > already include details about the minimum consistency LSN itself > stored in the control file. > I've removed the control file reference and instead continued the analogy to base backups. > + > +Because pg_rewind copies configuration > files > +entirely from the source, correcting recovery configuration options > before > +restarting the server is necessary if you intend to re-introduce the > target > +as a replica of the source. If you restart the server after the rewind > +operation has finished but without configuring recovery, the target > will > +again diverge from the primary. > + > > True that this is not outlined enough. > Thanks. > + The relation files are now to their state at the last checkpoint > completed > + prior to the point at which the WAL timelines of the source and > target > + diverged plus the current state on the source of any blocks changed > on the > + target after that divergence. > > "Relation files are now in a state equivalent to the moment of the > last completed checkpoint prior to the point.."? > Updated. > - pg_s
Re: pg_rewind docs correction
On Sun, Mar 08, 2020 at 05:13:21PM -0400, James Coleman wrote: > I've added the information about how the backup label control file is > written, and updated the How It Works steps to refer to that separately > from restart. > > Additionally the How It Works is updated to include WAL segments and new > relation files in the list of files copied wholesale, since that was > previously stated but somewhat contradicted there. - The result is equivalent to replacing the target data directory with the - source one. Only changed blocks from relation files are copied; - all other files are copied in full, including configuration files. The - advantage of pg_rewind over taking a new base backup, or - tools like rsync, is that pg_rewind does - not require reading through unchanged blocks in the cluster. This makes - it a lot faster when the database is large and only a small - fraction of blocks differ between the clusters. + After a successful rewind and subsequent WAL replay, the target data + directory is equivalent to a base backup of the source data directory. While + only changed blocks from existing relation files are copied; all other files + are copied in full, including new relation files, configuration files, and WAL + segments. The advantage of pg_rewind over taking a The first sentence you are adding refers to "subsequent WAL replay". However, this paragraph emphasizes with the state of the target cluster after running pg_rewind but *before* make the target cluster start recovery. So shouldn't you just remove the part "and subsequent WAL replay" from your first new sentence? In the same paragraph, I think that you should remove the "While" from "While only changed blocks", as the second part of the sentence refers to the other files, WAl segments, etc. The second paragraph of the docs regarding timeline lookup is unchanged, which is fine. - When the target server is started for the first time after running - pg_rewind, it will go into recovery mode and replay all - WAL generated in the source server after the point of divergence. + After running pg_rewind the data directory is + not immediately in a consistent state. However + pg_rewind configures the control file so that when + the target server is started again it will enter recovery mode and replay all + WAL generated in the source server after the point of divergence. The second part of the third paragraph is not changed, and the modification you are doing here is about the control file. I am still unconvinced that this is a good change, because mentioning the control file would be actually more adapted to the part "How it works", where you are adding details about the backup_label file, and already include details about the minimum consistency LSN itself stored in the control file. + +Because pg_rewind copies configuration files +entirely from the source, correcting recovery configuration options before +restarting the server is necessary if you intend to re-introduce the target +as a replica of the source. If you restart the server after the rewind +operation has finished but without configuring recovery, the target will +again diverge from the primary. + True that this is not outlined enough. + The relation files are now to their state at the last checkpoint completed + prior to the point at which the WAL timelines of the source and target + diverged plus the current state on the source of any blocks changed on the + target after that divergence. "Relation files are now in a state equivalent to the moment of the last completed checkpoint prior to the point.."? - pg_stat_tmp/, and - pg_subtrans/ are omitted from the data copied - from the source cluster. Any file or directory beginning with - pgsql_tmp is omitted, as well as are + pg_stat_tmp/, and pg_subtrans/ + are omitted from the data copied from the source cluster. The files This is just reorganizing an existing list, why? + Create a backup label file to begin WAL replay at the checkpoint created + at failover and a minimum consistency LSN using + pg_current_wal_insert_lsn(), when using a live source + and the last checkpoint LSN, when using a stopped source. Now would be the moment to mention the control file. > I realized I didn't previously add this to the CF; since it's not a new > patch I've added it to the current CF, but if this is incorrect please let > me know. The last CF of Postgres 13 began at the beginning of February :( -- Michael signature.asc Description: PGP signature
Re: pg_rewind docs correction
On Sun, Mar 8, 2020 at 5:13 PM James Coleman wrote: > > I realized I didn't previously add this to the CF; since it's not a new > patch I've added it to the current CF, but if this is incorrect please let > me know. > Hmm, looks like I can't add it to the current one. I added it to the next one. I think it could probably go now, since the patch is really 6 months old, but either way is fine -- it's just a docs patch. James
Re: pg_rewind docs correction
On Tue, Sep 17, 2019 at 9:41 PM Michael Paquier wrote: > On Tue, Sep 17, 2019 at 08:38:18AM -0400, James Coleman wrote: > > I don't agree that that's a valid equivalency. I myself spent a lot of > > time trying to understand how this could possibly be true a while > > back, and even looked at source code to be certain. I've asked other > > people and found the same confusion. > > > > As I read it the 2nd second sentence doesn't actually tell you the > > differences; it makes a quick attempt at summarizing *how* the first > > sentence is true, but if the first sentence isn't accurate, then it's > > hard to read the 2nd one as helping. > > Well, then it comes back to the part where I am used to the existing > docs :) > > > If you'd prefer something less detailed at this point at that point in > > the docs, then something along the lines of "results in a data > > directory state which can then be safely replayed from the source" or > > some such. > > Actually this is a good suggestion, and could replace the first > sentence of this paragraph. > > > The docs shouldn't be correct just for someone how already understands > > the intricacies. And the end user shouldn't have to read the "how it > > works" (which incidentally is kinda hidden at the bottom underneath > > the CLI args -- perhaps we could move that?) to extrapolate things in > > the primary documentation. > > Perhaps. This doc page is not that long either. > I'd set this aside for quite a while, but I was looking at it again this afternoon, and I've come to see your concern about the opening paragraphs remaining relatively simple. To that end I believe I've come up with a patch that's a good compromise: retaining that simplicity and being more clear and accurate at the same time. In the first paragraph I've updated it to refer to both "successful rewind and subsequent WAL replay" and the result I describe as being equivalent to the result of a base backup, since that's more technically correct anyway (the current text could be read as implying a full out copy of the data directory, but that's not really true just as it isn't with pg_basebackup). I've added the information about how the backup label control file is written, and updated the How It Works steps to refer to that separately from restart. Additionally the How It Works is updated to include WAL segments and new relation files in the list of files copied wholesale, since that was previously stated but somewhat contradicted there. I realized I didn't previously add this to the CF; since it's not a new patch I've added it to the current CF, but if this is incorrect please let me know. Thanks, James From 592ba15c35bb16e55b0bb0a7e7bdbb6dd4e08a0b Mon Sep 17 00:00:00 2001 From: James Coleman Date: Sun, 8 Mar 2020 16:39:45 -0400 Subject: [PATCH v3] Improve pg_rewind explanation and warnings The pg_rewind docs currently assert that the state of the target's data directory after rewind is equivalent to the source's data directory. But that isn't quite true both because the base state is further back in time and because the target's data directory will include the current state on the source of any copied blocks. Additionally the state isn't equal to a copy of the source data directory; it's equivalent to a base backup of the source. The How It Works section now: - Includes details about how the backup label file is created. - Is updated to include WAL segments and new relation files in the list of files copied wholesale from the source. Finally, document clearly the state of the cluster after the operation and also the operation sequencing dangers caused by copying configuration files from the source. --- doc/src/sgml/ref/pg_rewind.sgml | 87 - 1 file changed, 54 insertions(+), 33 deletions(-) diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml index 42d29edd4e..bc6f0009cc 100644 --- a/doc/src/sgml/ref/pg_rewind.sgml +++ b/doc/src/sgml/ref/pg_rewind.sgml @@ -48,14 +48,16 @@ PostgreSQL documentation - The result is equivalent to replacing the target data directory with the - source one. Only changed blocks from relation files are copied; - all other files are copied in full, including configuration files. The - advantage of pg_rewind over taking a new base backup, or - tools like rsync, is that pg_rewind does - not require reading through unchanged blocks in the cluster. This makes - it a lot faster when the database is large and only a small - fraction of blocks differ between the clusters. + After a successful rewind and subsequent WAL replay, the target data + directory is equivalent to a base backup of the source data directory. While + only changed blocks from existing relation files are copied; all other files + are copied in full, including new relation files, configuration files, and WAL + segments. The advantage of pg_rewind over taking a + new base backup, or tools like rsync, is t
Re: pg_rewind docs correction
On Tue, Sep 17, 2019 at 08:38:18AM -0400, James Coleman wrote: > I don't agree that that's a valid equivalency. I myself spent a lot of > time trying to understand how this could possibly be true a while > back, and even looked at source code to be certain. I've asked other > people and found the same confusion. > > As I read it the 2nd second sentence doesn't actually tell you the > differences; it makes a quick attempt at summarizing *how* the first > sentence is true, but if the first sentence isn't accurate, then it's > hard to read the 2nd one as helping. Well, then it comes back to the part where I am used to the existing docs :) > If you'd prefer something less detailed at this point at that point in > the docs, then something along the lines of "results in a data > directory state which can then be safely replayed from the source" or > some such. Actually this is a good suggestion, and could replace the first sentence of this paragraph. > The docs shouldn't be correct just for someone how already understands > the intricacies. And the end user shouldn't have to read the "how it > works" (which incidentally is kinda hidden at the bottom underneath > the CLI args -- perhaps we could move that?) to extrapolate things in > the primary documentation. Perhaps. This doc page is not that long either. -- Michael signature.asc Description: PGP signature
Re: pg_rewind docs correction
On Tue, Sep 17, 2019 at 3:51 AM Michael Paquier wrote: > > On Sun, Sep 15, 2019 at 10:36:04AM -0400, James Coleman wrote: > > On Sun, Sep 15, 2019 at 10:25 AM Michael Paquier > > wrote: > >> +The rewind operation is not expected to result in a consistent data > >> +directory state either internally to the node or with respect to the > >> rest > >> +of the cluster. Instead the resulting data directory will only be > >> consistent > >> +after WAL replay has completed to at least the LSN at which changed > >> blocks > >> +copied from the source were originally written on the source. > >> > >> That's not necessarily true. pg_rewind enforces in the control file > >> of the target the minimum consistency LSN to be > >> pg_current_wal_insert_lsn() when using a live source or the last > >> checkpoint LSN for a stopped source, so while that sounds true from > >> the point of view of all the blocks copied, the control file may still > >> cause a complain that the target recovering has not reached its > >> consistent point even if all the blocks are already at a position > >> not-so-far from what has been registered in the control file. > > > > I could just say "after WAL replay has completed to a consistent state"? > > I still would not change this paragraph. The first sentence means > that we have an equivalency, because that's the case if you think > about it as we make sure that the target is able to sync with the > source, and the target gets into a state where it as an on-disk state > equivalent to the target up to the minimum consistency point defined > in the control file once the tool has done its work (this last point > is too precise to be included in a global description to be honest). > And the second sentence makes clear what are the actual diffs are. > >> + the point at which the WAL timelines of the source and target diverged > >> plus > >> + the current state on the source of any blocks changed on the target > >> after > >> + that divergence. While only changed blocks from existing relation > >> files are > >> > >> And here we could mention that all the blocks copied from the source > >> are the ones which are found in the WAL records of the target until > >> the end of WAL of its timeline. Still, that's basically what is > >> mentioned in the first part of "How It Works", which explains things > >> better. I honestly don't really see that all this paragraph is an > >> improvement over the simplicity of the original when it comes to > >> understand the global idea of what pg_rewind does. > > > > The problem with the original is that while simple, it's actually > > incorrect in that simplicity. Pg_rewind does *not* result in the data > > directory on the target matching the data directory on the source. > > That's not what I get from the original docs, but I may be too much > used to it. I don't agree that that's a valid equivalency. I myself spent a lot of time trying to understand how this could possibly be true a while back, and even looked at source code to be certain. I've asked other people and found the same confusion. As I read it the 2nd second sentence doesn't actually tell you the differences; it makes a quick attempt at summarizing *how* the first sentence is true, but if the first sentence isn't accurate, then it's hard to read the 2nd one as helping. If you'd prefer something less detailed at this point at that point in the docs, then something along the lines of "results in a data directory state which can then be safely replayed from the source" or some such. The docs shouldn't be correct just for someone how already understands the intricacies. And the end user shouldn't have to read the "how it works" (which incidentally is kinda hidden at the bottom underneath the CLI args -- perhaps we could move that?) to extrapolate things in the primary documentation. James Coleman
Re: pg_rewind docs correction
On Sun, Sep 15, 2019 at 10:36:04AM -0400, James Coleman wrote: > On Sun, Sep 15, 2019 at 10:25 AM Michael Paquier wrote: >> +The rewind operation is not expected to result in a consistent data >> +directory state either internally to the node or with respect to the >> rest >> +of the cluster. Instead the resulting data directory will only be >> consistent >> +after WAL replay has completed to at least the LSN at which changed >> blocks >> +copied from the source were originally written on the source. >> >> That's not necessarily true. pg_rewind enforces in the control file >> of the target the minimum consistency LSN to be >> pg_current_wal_insert_lsn() when using a live source or the last >> checkpoint LSN for a stopped source, so while that sounds true from >> the point of view of all the blocks copied, the control file may still >> cause a complain that the target recovering has not reached its >> consistent point even if all the blocks are already at a position >> not-so-far from what has been registered in the control file. > > I could just say "after WAL replay has completed to a consistent state"? I still would not change this paragraph. The first sentence means that we have an equivalency, because that's the case if you think about it as we make sure that the target is able to sync with the source, and the target gets into a state where it as an on-disk state equivalent to the target up to the minimum consistency point defined in the control file once the tool has done its work (this last point is too precise to be included in a global description to be honest). And the second sentence makes clear what are the actual diffs are. >> + the point at which the WAL timelines of the source and target diverged >> plus >> + the current state on the source of any blocks changed on the target after >> + that divergence. While only changed blocks from existing relation files >> are >> >> And here we could mention that all the blocks copied from the source >> are the ones which are found in the WAL records of the target until >> the end of WAL of its timeline. Still, that's basically what is >> mentioned in the first part of "How It Works", which explains things >> better. I honestly don't really see that all this paragraph is an >> improvement over the simplicity of the original when it comes to >> understand the global idea of what pg_rewind does. > > The problem with the original is that while simple, it's actually > incorrect in that simplicity. Pg_rewind does *not* result in the data > directory on the target matching the data directory on the source. That's not what I get from the original docs, but I may be too much used to it. >> + >> +Because pg_rewind copies configuration files >> +entirely from the source, correcting recovery configuration options >> before >> +restarting the server is necessary if you intend to re-introduce the >> target >> +as a replica of the source. If you restart the server after the rewind >> +operation has finished but without configuring recovery, the target will >> +again diverge from the primary. >> + >> >> No objections regarding that part. Now it seems to me that we had >> better apply that to the last part of "How it works" instead? I kind >> of agree that the last paragraph could provide more details regarding >> the risks of overwriting the wanted configuration. The existing docs >> also mention that pg_rewind only creates a backup_label file to start >> recovery, perhaps we could mention up to which point recovery happens >> in this section? There is a bit more here than just "apply the WAL". > > I'll look to see if there's a better place to put this. Thanks. From what I can see, we could improve further the doc part about how the tool works in details, especially regarding the configuration files which may get overwritten and be more precise about that. -- Michael signature.asc Description: PGP signature
Re: pg_rewind docs correction
On Sun, Sep 15, 2019 at 10:25 AM Michael Paquier wrote: > > On Sat, Sep 14, 2019 at 07:00:54PM -0400, James Coleman wrote: > > Updated (plus some additional wordsmithing). > > +The rewind operation is not expected to result in a consistent data > +directory state either internally to the node or with respect to the rest > +of the cluster. Instead the resulting data directory will only be > consistent > +after WAL replay has completed to at least the LSN at which changed > blocks > +copied from the source were originally written on the source. > > That's not necessarily true. pg_rewind enforces in the control file > of the target the minimum consistency LSN to be > pg_current_wal_insert_lsn() when using a live source or the last > checkpoint LSN for a stopped source, so while that sounds true from > the point of view of all the blocks copied, the control file may still > cause a complain that the target recovering has not reached its > consistent point even if all the blocks are already at a position > not-so-far from what has been registered in the control file. I could just say "after WAL replay has completed to a consistent state"? > + the point at which the WAL timelines of the source and target diverged > plus > + the current state on the source of any blocks changed on the target after > + that divergence. While only changed blocks from existing relation files > are > > And here we could mention that all the blocks copied from the source > are the ones which are found in the WAL records of the target until > the end of WAL of its timeline. Still, that's basically what is > mentioned in the first part of "How It Works", which explains things > better. I honestly don't really see that all this paragraph is an > improvement over the simplicity of the original when it comes to > understand the global idea of what pg_rewind does. The problem with the original is that while simple, it's actually incorrect in that simplicity. Pg_rewind does *not* result in the data directory on the target matching the data directory on the source. > + > +Because pg_rewind copies configuration files > +entirely from the source, correcting recovery configuration options > before > +restarting the server is necessary if you intend to re-introduce the > target > +as a replica of the source. If you restart the server after the rewind > +operation has finished but without configuring recovery, the target will > +again diverge from the primary. > + > > No objections regarding that part. Now it seems to me that we had > better apply that to the last part of "How it works" instead? I kind > of agree that the last paragraph could provide more details regarding > the risks of overwriting the wanted configuration. The existing docs > also mention that pg_rewind only creates a backup_label file to start > recovery, perhaps we could mention up to which point recovery happens > in this section? There is a bit more here than just "apply the WAL". I'll look to see if there's a better place to put this. James Coleman
Re: pg_rewind docs correction
On Sat, Sep 14, 2019 at 07:00:54PM -0400, James Coleman wrote: > Updated (plus some additional wordsmithing). +The rewind operation is not expected to result in a consistent data +directory state either internally to the node or with respect to the rest +of the cluster. Instead the resulting data directory will only be consistent +after WAL replay has completed to at least the LSN at which changed blocks +copied from the source were originally written on the source. That's not necessarily true. pg_rewind enforces in the control file of the target the minimum consistency LSN to be pg_current_wal_insert_lsn() when using a live source or the last checkpoint LSN for a stopped source, so while that sounds true from the point of view of all the blocks copied, the control file may still cause a complain that the target recovering has not reached its consistent point even if all the blocks are already at a position not-so-far from what has been registered in the control file. + the point at which the WAL timelines of the source and target diverged plus + the current state on the source of any blocks changed on the target after + that divergence. While only changed blocks from existing relation files are And here we could mention that all the blocks copied from the source are the ones which are found in the WAL records of the target until the end of WAL of its timeline. Still, that's basically what is mentioned in the first part of "How It Works", which explains things better. I honestly don't really see that all this paragraph is an improvement over the simplicity of the original when it comes to understand the global idea of what pg_rewind does. + +Because pg_rewind copies configuration files +entirely from the source, correcting recovery configuration options before +restarting the server is necessary if you intend to re-introduce the target +as a replica of the source. If you restart the server after the rewind +operation has finished but without configuring recovery, the target will +again diverge from the primary. + No objections regarding that part. Now it seems to me that we had better apply that to the last part of "How it works" instead? I kind of agree that the last paragraph could provide more details regarding the risks of overwriting the wanted configuration. The existing docs also mention that pg_rewind only creates a backup_label file to start recovery, perhaps we could mention up to which point recovery happens in this section? There is a bit more here than just "apply the WAL". -- Michael signature.asc Description: PGP signature
Re: pg_rewind docs correction
On Sat, Sep 14, 2019 at 12:20 AM Michael Paquier wrote: > > On Fri, Sep 13, 2019 at 01:47:03PM -0400, James Coleman wrote: > > So I've attached a patch to summarize more correctly as well as > > document clearly the state of the cluster after the operation and also > > the operation sequencing dangers caused by copying configuration > > files from the source. > > + After a successful rewind, the target data directory is equivalent > to the > + to the state of the data directory at the point at which the > source and > + target diverged plus the current state on the source of any blocks > changed > + on the target after that divergence. While only changed blocks > from relation > + files are copied; all other files are copied in full, including > configuration > + files and WAL segments. The advantage of > pg_rewind > + over taking a new base backup, or tools like > rsync, > + is that pg_rewind does not require > comparing or > + copying unchanged relation blocks in the cluster. As such the > rewind operation > + is significantly faster than other approaches when the database is > large and > + only a small fraction of blocks differ between the clusters. > > The point of divergence could be defined as the LSN position where WAL > has forked on the new timeline, but the block diffs are copied from > actually the last checkpoint just before WAL has forked. So this new > paragraph brings confusion about the actual divergence point. > > Regarding the relation files, if the file does not exist on the target > but does exist on the source, it is also copied fully, so the second > sentence is wrong here to mention as relation files could also be > copied fully. Updated (plus some additional wordsmithing). James Coleman 001_pg_rewind_explanation_doc_fixes_v2.patch Description: Binary data
Re: pg_rewind docs correction
On Fri, Sep 13, 2019 at 01:47:03PM -0400, James Coleman wrote: > So I've attached a patch to summarize more correctly as well as > document clearly the state of the cluster after the operation and also > the operation sequencing dangers caused by copying configuration > files from the source. + After a successful rewind, the target data directory is equivalent to the + to the state of the data directory at the point at which the source and + target diverged plus the current state on the source of any blocks changed + on the target after that divergence. While only changed blocks from relation + files are copied; all other files are copied in full, including configuration + files and WAL segments. The advantage of pg_rewind + over taking a new base backup, or tools like rsync, + is that pg_rewind does not require comparing or + copying unchanged relation blocks in the cluster. As such the rewind operation + is significantly faster than other approaches when the database is large and + only a small fraction of blocks differ between the clusters. The point of divergence could be defined as the LSN position where WAL has forked on the new timeline, but the block diffs are copied from actually the last checkpoint just before WAL has forked. So this new paragraph brings confusion about the actual divergence point. Regarding the relation files, if the file does not exist on the target but does exist on the source, it is also copied fully, so the second sentence is wrong here to mention as relation files could also be copied fully. -- Michael signature.asc Description: PGP signature
pg_rewind docs correction
The pg_rewind docs assert that the state of the target's data directory after rewind is equivalent to the source's data directory. But that isn't true both because the base state is further back in time and because the target's data directory will include the current state on the source of any copied blocks. So I've attached a patch to summarize more correctly as well as document clearly the state of the cluster after the operation and also the operation sequencing dangers caused by copying configuration files from the source. James Coleman 001_pg_rewind_explanation_doc_fixes.patch Description: Binary data