Some comments, aside from the design wrt. bounding boxes etc. :

On 06/15/2014 05:34 AM, Alvaro Herrera wrote:
Robert Haas wrote:
On Wed, Sep 25, 2013 at 4:34 PM, Alvaro Herrera
<alvhe...@2ndquadrant.com> wrote:
Here's an updated version of this patch, with fixes to all the bugs
reported so far.  Thanks to Thom Brown, Jaime Casanova, Erik Rijkers and
Amit Kapila for the reports.

I'm not very happy with the use of a separate relation fork for
storing this data.

Here's a new version of this patch.  Now the revmap is not stored in a
separate fork, but together with all the regular data, as explained
elsewhere in the thread.

Thanks! Please update the README accordingly.

If I understand the code correctly, the revmap is a three-level deep structure. The bottom level consists of "regular revmap pages", and each regular revmap page is filled with ItemPointerDatas, which point to the index tuples. The middle level consists of "array revmap pages", and each array revmap page contains an array of BlockNumbers of the "regular revmap" pages. The top level is an array of BlockNumbers of the array revmap pages, and it is stored in the metapage.

With 8k block size, that's just enough to cover the full range of 2^32-1 blocks that you'll need if you set mm_pages_per_range=1. Each regular revmap page can store about 8192/6 = 1365 item pointers, each array revmap page can store about 8192/4 = 2048 block references, and the size of the top array is 8192/4. That's just enough; to store the required number of array pages in the top array, the array needs to be (2^32/1365)/2048)=1536 elements large.

But with 4k or smaller blocks, it's not enough.

I wonder if it would be simpler to just always store the revmap pages in the beginning of the index, before any other pages. Finding the revmap page would then be just as easy as with a separate fork. When the table/index is extended so that a new revmap page is needed, move the existing page at that block out of the way. Locking needs some consideration, but I think it would be feasible and simpler than you have now.

I have followed the suggestion by Amit to overwrite the index tuple when
a new heap tuple is inserted, instead of creating a separate index
tuple.  This saves a lot of index bloat.  This required a new entry
point in bufpage.c, PageOverwriteItemData().  bufpage.c also has a new
function PageIndexDeleteNoCompact which is similar in spirit to
PageIndexMultiDelete except that item pointers do not change.  This is
necessary because the revmap stores item pointers, and such reference
would break if we were to renumber items in index pages.

ISTM that when the old tuple cannot be updated in-place, the new index tuple is inserted with mm_doinsert(), but the old tuple is never deleted.

- Heikki

--
- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to