On Fri, Jan 14, 2022 at 1:08 AM Andres Freund <and...@anarazel.de> wrote: > > Hi, > > On 2021-12-31 18:12:37 +0530, Bharath Rupireddy wrote: > > Currently the server is erroring out when unable to remove/parse a > > logical rewrite file in CheckPointLogicalRewriteHeap wasting the > > amount of work the checkpoint has done and preventing the checkpoint > > from finishing. > > This seems like it'd make failures to remove the files practically > invisible. Which'd have it's own set of problems? > > What motivated proposing this change?
We had an issue where there were many mapping files generated during the crash recovery and end-of-recovery checkpoint was taking a lot of time. We had to manually intervene and delete some of the mapping files (although it may not sound sensible) to make end-of-recovery checkpoint faster. Because of the race condition between manual deletion and checkpoint deletion, the unlink error occurred which crashed the server and the server entered the recovery again wasting the entire earlier recovery work. In summary, with the changes (emitting LOG-only messages for unlink failures and continuing with the other files) proposed for CheckPointLogicalRewriteHeap in this thread and the existing code in CheckPointSnapBuild, I'm sure it will help not waste the recovery that's has been done in case unlink fails for any reasons. Regards, Bharath Rupireddy.