On Thu, Apr 7, 2022 at 7:01 AM Robert Haas <robertmh...@gmail.com> wrote: > Because there's no place to put them in the existing page format. We > jammed checksums into the 2-byte field that had previously been set > aside for the TLI, but that wasn't really an ideal solution because it > meant we ended up with a checksum that is only 16 bits wide. However, > the 2 bytes set aside for the TLI weren't really being used > effectively anyway, so repurposing them was relatively easy, and a > 16-bit checksum is better than nothing.
But if we were in a green-field situation we'd probably not want to use up several bytes for a nonse anyway. You said so yourself. > I do understand that there are significant challenges and performance > concerns around having these kinds of initdb-controlled page layout > changes, so the future of that patch is unclear. Why does it need to be at initdb time? Though I cannot prove it, I suspect that the original intent of the special area was to support an additional (though typically small) variable length array, that works a little like the current line pointer array. This array would have to grow backwards (newer items get appended at earlier physical offsets), unlike our line pointer array (which gets appended to at the end, in the simple and obvious way). Growing backwards like this happens with DB systems, that store their line pointer array at the end of the page(the traditional approach from the System R days, I believe). Supporting a variable-length special area array like this would mean that any time you add a new item to the variable-sized array in the special area, the page's entire tuple space has to be memmove()'d backwards by a couple of bytes to create the required space. And so the relevant bufpage.c routine would have to adjust the whole line pointer array such that each lp_off received a compensating adjustment. The array might only be for some kind of page-level transaction metadata, something like that -- shifting it around is pretty expensive (reusing existing slots isn't too expensive, though). Why can't it work like that? You don't really need to build the full set of bufpage.c facilities (though it might not be a bad idea to fully support these variable-length arrays, which seem like they might come in handy). That seems perfectly compatible with what Matthias wants to do, provided we're willing to deem the special area struct (e.g. BTOpaque) as always coming "first" (which is essentially the same as his current proposal anyway). You can even do the same thing yourself for the nonse (use a fixed, known offset), with relatively modest effort. You'd need to have AM-specific knowledge (it would stack right on top of Matthias's technique), but that doesn't seem all that hard. There are plenty of remaining status bits in BTOpaque, and probably all other index AM special areas. -- Peter Geoghegan