On 03/17/2014 05:35 PM, Tom Lane wrote:
Robert Haas <robertmh...@gmail.com> writes:
On Mon, Mar 17, 2014 at 10:54 AM, Heikki Linnakangas
The imminent danger I see is if we change the logic on how the items are
divided into posting lists, and end up in a situation where a master server
adds an item to a page, and it just fits, but with the compression logic the
standby version has, it cannot make it fit. As an escape hatch for that, we
could have the WAL replay code try the compression again, with a larger max.
posting list size, if it doesn't fit at first. And/or always leave something
like 10 bytes of free space on every data page to make up for small
differences in the logic.
That scares the crap out of me.
Likewise. Saving some WAL space is *not* worth this kind of risk.
One fairly good compromise would be to only include the new items, not
the whole modified compression lists, and let the replay logic do the
re-encoding of the posting lists. But also include the cutoff points of
each posting list in the WAL record. That way the replay code would have
no freedom in how it decides to split the items into compressed lists,
that would be fully specified by the WAL record.
Here's a refresher for those who didn't follow the development of the
new page format: The data page basically contains a list of
ItemPointers. The items are compressed, to save disk space. However, to
make random access faster, all the items on the page are not compressed
as one big list. Instead, the big array of items is split into roughly
equal chunks, and each chunk is compressed separately. The chunks are
stored on the page one after each other. (The chunks are called "posting
lists" in the code, the struct is called GinPostingListData)
The compression is completely deterministic (each item is stored as a
varbyte-encoded delta from the previous item), but there are no hard
rules on how the items on the page ought to be divided into the posting
lists. Currently, the code tries to maintain a max size of 256 bytes per
list - but it will cope with any size it finds on disk. This is where
the danger lies, where we could end up with a different physical page
after WAL replay, if we just include the new items in the WAL record.
The WAL replay might decide to split the items into posting lists
differently than was originally done. (as the code stands, it would
always make the same decision, completely deterministically, but that
might change in a minor version if we're not careful)
We can tie WAL replay's hands about that, if we include a list of items
that form the posting lists in the WAL record. That adds some bloat,
compared to only including the new items, but not too much. (and we
still only need do that for posting lists following the first modified one.)
Alexander, would you like to give that a shot, or will I?
Sent via pgsql-hackers mailing list (email@example.com)
To make changes to your subscription: