On Fri, 2021-10-22 at 15:34 +0530, Bharath Rupireddy wrote: > If the suggestion is to have the wait and retry logic embedded into > the user-written restore_command, IMHO, it's not a good idea as the > restore_command is external to the core PG and the FATAL error > "recovery ended before configured recovery target was reached" is an > internal thing.
It seems likely that you'd want to tweak the exact behavior for the given system. For instance, if the files are making some progress, and you can estimate that in 2 more minutes everything will be fine, then you may be more willing to wait those two minutes. But if no progress has happened since recovery began 15 minutes ago, you may want to fail immediately. All of this nuance would be better captured in a specialized script than a generic timeout in the server code. What do you want to do after the timeout happens? If you want to issue a WARNING instead of failing outright, perhaps that makes sense for exploratory PITR cases. That could be a simple boolean GUC without needing to introduce the timeout logic into the server. I think it's an interesting point that it can be hard to choose a reasonable recovery target if the system is completely down. We could use some better tooling or metadata around the lsns, xids or timestamp ranges available in a pg_wal directory or an archive. Even better would be to see the available named restore points. This would make is easier to calculate how long recovery might take for a given restore point, or whether it's not going to work at all because there's not enough WAL. Regards, Jeff Davis