Re: [HACKERS] HEAD seems to generate larger WAL regarding GIN index

Heikki Linnakangas Mon, 17 Mar 2014 08:57:07 -0700

On 03/17/2014 05:35 PM, Tom Lane wrote:

Robert Haas <robertmh...@gmail.com> writes:

On Mon, Mar 17, 2014 at 10:54 AM, Heikki Linnakangas
<hlinnakan...@vmware.com> wrote:

The imminent danger I see is if we change the logic on how the items are
divided into posting lists, and end up in a situation where a master server
adds an item to a page, and it just fits, but with the compression logic the
standby version has, it cannot make it fit. As an escape hatch for that, we
could have the WAL replay code try the compression again, with a larger max.
posting list size, if it doesn't fit at first. And/or always leave something
like 10 bytes of free space on every data page to make up for small
differences in the logic.

That scares the crap out of me.


Likewise.  Saving some WAL space is *not* worth this kind of risk.

One fairly good compromise would be to only include the new items, notthe whole modified compression lists, and let the replay logic do there-encoding of the posting lists. But also include the cutoff points ofeach posting list in the WAL record. That way the replay code would haveno freedom in how it decides to split the items into compressed lists,that would be fully specified by the WAL record.

Here's a refresher for those who didn't follow the development of thenew page format: The data page basically contains a list ofItemPointers. The items are compressed, to save disk space. However, tomake random access faster, all the items on the page are not compressedas one big list. Instead, the big array of items is split into roughlyequal chunks, and each chunk is compressed separately. The chunks arestored on the page one after each other. (The chunks are called "postinglists" in the code, the struct is called GinPostingListData)

The compression is completely deterministic (each item is stored as avarbyte-encoded delta from the previous item), but there are no hardrules on how the items on the page ought to be divided into the postinglists. Currently, the code tries to maintain a max size of 256 bytes perlist - but it will cope with any size it finds on disk. This is wherethe danger lies, where we could end up with a different physical pageafter WAL replay, if we just include the new items in the WAL record.The WAL replay might decide to split the items into posting listsdifferently than was originally done. (as the code stands, it wouldalways make the same decision, completely deterministically, but thatmight change in a minor version if we're not careful)

We can tie WAL replay's hands about that, if we include a list of itemsthat form the posting lists in the WAL record. That adds some bloat,compared to only including the new items, but not too much. (and westill only need do that for posting lists following the first modified one.)


Alexander, would you like to give that a shot, or will I?

- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] HEAD seems to generate larger WAL regarding GIN index

Reply via email to