On 03/17/2014 04:33 PM, Tom Lane wrote:
Heikki Linnakangas <hlinnakan...@vmware.com> writes:
2. Instead of storing the new compressed posting list in the WAL record,
store only the new item pointers added to the page. WAL replay would
then have to duplicate the work done in the main insertion code path:
find the right posting lists to insert to, decode them, add the new
items, and re-encode.

That sounds fairly dangerous ... is any user-defined code involved in
those decisions?

No.

This record format would be higher-level, in the sense that we would not
store the physical copy of the compressed posting list as it was formed
originally. The same work would be done at WAL replay. As the code
stands, it will produce exactly the same result, but that's not
guaranteed if we make bugfixes to the code later, and a master and
standby are running different minor version. There's not necessarily
anything wrong with that, but it's something to keep in mind.

Version skew would be a hazard too, all right.  I think it's important
that WAL replay be a pretty mechanical, predictable process.

Yeah. One particular point to note is that if in one place we do the more "high level" thing and have WAL replay re-encode the page as it sees fit, then we can *not* rely on the page being byte-by-byte identical in other places. Like, in vacuum, where items are deleted.

Heap and B-tree WAL records also rely on PageAddItem etc. to reconstruct the page, instead of making a physical copy of the modified parts. And _bt_restore_page even inserts the items physically in different order than the normal codepath does. So for good or bad, there is some precedence for this.

The imminent danger I see is if we change the logic on how the items are divided into posting lists, and end up in a situation where a master server adds an item to a page, and it just fits, but with the compression logic the standby version has, it cannot make it fit. As an escape hatch for that, we could have the WAL replay code try the compression again, with a larger max. posting list size, if it doesn't fit at first. And/or always leave something like 10 bytes of free space on every data page to make up for small differences in the logic.

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to