On 03/17/2014 05:35 PM, Tom Lane wrote:
Robert Haas <robertmh...@gmail.com> writes:
On Mon, Mar 17, 2014 at 10:54 AM, Heikki Linnakangas
<hlinnakan...@vmware.com> wrote:
The imminent danger I see is if we change the logic on how the items are
divided into posting lists, and end up in a situation where a master server
adds an item to a page, and it just fits, but with the compression logic the
standby version has, it cannot make it fit. As an escape hatch for that, we
could have the WAL replay code try the compression again, with a larger max.
posting list size, if it doesn't fit at first. And/or always leave something
like 10 bytes of free space on every data page to make up for small
differences in the logic.

That scares the crap out of me.

Likewise.  Saving some WAL space is *not* worth this kind of risk.

One fairly good compromise would be to only include the new items, not the whole modified compression lists, and let the replay logic do the re-encoding of the posting lists. But also include the cutoff points of each posting list in the WAL record. That way the replay code would have no freedom in how it decides to split the items into compressed lists, that would be fully specified by the WAL record.

Here's a refresher for those who didn't follow the development of the new page format: The data page basically contains a list of ItemPointers. The items are compressed, to save disk space. However, to make random access faster, all the items on the page are not compressed as one big list. Instead, the big array of items is split into roughly equal chunks, and each chunk is compressed separately. The chunks are stored on the page one after each other. (The chunks are called "posting lists" in the code, the struct is called GinPostingListData)

The compression is completely deterministic (each item is stored as a varbyte-encoded delta from the previous item), but there are no hard rules on how the items on the page ought to be divided into the posting lists. Currently, the code tries to maintain a max size of 256 bytes per list - but it will cope with any size it finds on disk. This is where the danger lies, where we could end up with a different physical page after WAL replay, if we just include the new items in the WAL record. The WAL replay might decide to split the items into posting lists differently than was originally done. (as the code stands, it would always make the same decision, completely deterministically, but that might change in a minor version if we're not careful)

We can tie WAL replay's hands about that, if we include a list of items that form the posting lists in the WAL record. That adds some bloat, compared to only including the new items, but not too much. (and we still only need do that for posting lists following the first modified one.)

Alexander, would you like to give that a shot, or will I?

- Heikki

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to