On Thu, Oct 12, 2023 at 9:57 AM Ants Aasma <a...@cybertec.at> wrote: > This reminds me that xlp_tli is not being used to its full potential right > now either. We only check that it's not going backwards, but there is at > least one not very hard to hit way to get postgres to silently replay on the > wrong timeline. [1] > > [1] > https://www.postgresql.org/message-id/canwkhkmn3qwacvudzhb6wsvlrtkwebiyso-klfykkqvwuql...@mail.gmail.com
Maybe I'm missing something, but that seems mostly unrelated. What you're discussing there is the server's ability to figure out when it ought to perform a timeline switch. In other words, the server settles on the wrong TLI and therefore opens and reads from the wrong filename. But here, we're talking about the case where the server is correct about the TLI and LSN and hence opens exactly the right file on disk, but the contents of the file on disk aren't what they're supposed to be due to a procedural error. Said differently, I don't see how anything we could do with xlp_tli would actually fix the problem discussed in that thread. That can detect a situation where the TLI of the file doesn't match the TLI of the pages inside the file, but it doesn't help with the case where the server decided to read the wrong file in the first place. But this does make me wonder whether storing xlp_tli and xlp_pageaddr in every page is really worth the bit-space. That takes 12 bytes plus any padding it forces us to incur, but the actual entropy content of those 12 bytes must be quite low. In normal cases probably 7 or so of those bytes are going to consist entirely of zero bits (TLI < 256, LSN%8k == 0, LSN < 2^40). We could probably find a way of jumbling the LSN, TLI, and maybe some other stuff into an 8-byte quantity or even perhaps a 4-byte quantity that would do about as good a job catching problems as what we have now (e.g. LSN_HIGH32^LSN_LOW32^BITREVERSE(TLI)). In the event of a mismatch, the value actually stored in the page header would be harder for humans to understand, but I'm not sure that really matters here. Users should mostly be concerned with whether a WAL file matches the cluster where they're trying to replay it; forensics on misplaced or corrupted WAL files should be comparatively rare. -- Robert Haas EDB: http://www.enterprisedb.com