On Wed, Sep 14, 2016 at 1:23 PM, Alvaro Herrera
<alvhe...@2ndquadrant.com> wrote:
> Robert Haas wrote:
>> Actually, I think that probably *is* worthwhile, specifically because
>> it might let us avoid multiple index scans in cases where we currently
>> require them.  Right now, our default maintenance_work_mem value is
>> 64MB, which is enough to hold a little over ten million tuples.  It's
>> also large enough to hold a bitmap for a 14GB table.  So right now if
>> you deleted, say, 100 tuples per page you would end up with an index
>> vacuum cycles for every ~100,000 pages = 800MB, whereas switching to
>> the bitmap representation for such cases would require only one index
>> vacuum cycle for every 14GB, more than an order of magnitude
>> improvement!
> Yeah, this sounds worthwhile.  If we switch to the more compact
> in-memory representation close to the point where we figure the TID
> array is not going to fit in m_w_m, then we're saving some number of
> additional index scans, and I'm pretty sure that the time to transform
> from array to bitmap is going to be more than paid back by the I/O
> savings.

Yes, that seems pretty clear.  The indexes can be arbitrarily large
and there can be arbitrarily many of them, so we could be save a LOT
of I/O.

> One thing not quite clear to me is how do we create the bitmap
> representation starting from the array representation in midflight
> without using twice as much memory transiently.  Are we going to write
> the array to a temp file, free the array memory, then fill the bitmap by
> reading the array from disk?

I was just thinking about this exact problem while I was out to
lunch.[1]  I wonder if we could do something like this:

1. Allocate an array large enough for one pointer per gigabyte of the
underlying relation.

2. Allocate 64MB, or the remaining amount of maintenance_work_mem if
it's less, to store TIDs.

3. At the beginning of each 1GB chunk, add a pointer to the first free
byte in the slab allocated in step 2 to the array allocated in step 1.
Write a header word that identifies this as a TID list (rather than a
bitmap) and leave space for a TID count; then, write the TIDs
afterwards.  Continue doing this until one of the following things
happens: (a) we reach the end of the 1GB chunk - if that happens,
restart step 3 for the next chunk; (b) we fill the chunk - see step 4,
or (c) we write more TIDs for the chunk than the space being used for
TIDs now exceeds the space needed for a bitmap - see step 5.

4. When we fill up one of the slabs allocated in step 2, allocate a
new one and move the tuples for the current 1GB chunk to the beginning
of the new slab using memmove().  This is wasteful of both CPU time
and memory, but I think it's not that bad.  The maximum possible waste
is less than 10%, and many allocators have more overhead than that.
We could reduce the waste by using, say, 256MB chunks rather than 1GB
chunks.   If no new slab can be allocated because maintenance_work_mem
is completely exhausted (or the remaining space isn't enough for the
TIDs that would need to be moved immediately), then stop and do an
index vacuum cycle.

5. When we write a large enough number of TIDs for 1GB chunk that the
bitmap would be smaller, check whether sufficient maintenance_work_mem
remains to allocate a bitmap for 1GB chunk (~4.5MB).  If not, never
mind; continue with step 3 as if the bitmap representation did not
exist.  If so, allocate space for a bitmap, move all of the TIDs for
the current chunk into it, and update the array allocated in step 1 to
point to it.  Then, finish scanning the current 1GB chunk, updating
that bitmap rather than inserting TIDs into the slab.  Rewind our
pointer into the slab to where it was at the beginning of the current
1GB chunk, so that the memory we consumed for TIDs can be reused now
that those TIDs have been transferred to a bitmap.  If, earlier in the
current 1GB chunk, we did a memmove-to-next-slab operation as
described in step 4, this "rewind" might move our pointer back into
the previous slab, in which case we can free the now-empty slab.  (The
next 1GB segment might have few enough TIDs that they will fit into
the leftover space in the previous slab.)

With this algorithm, we never exceed maintenance_work_mem, not even
transiently.  When memory is no longer sufficient to convert to the
bitmap representation without bursting above maintenance_work_mem, we
simply don't perform the conversion.  Also, we do very little memory
copying.  An alternative I considered was to do a separate allocation
for each 1GB chunk rather than carving the dead-tuple space out of
slabs.  But the problem with that is that you'll have to start those
out small (in case you don't find many dead tuples) and then grow
them, which means reallocating, which is bad both because it can burst
above maintenance_work_mem while the repalloc is in process and also
because you have to keep copying the data from the old chunk to the
new, bigger chunk.  This algorithm only needs to copy TIDs when it
runs off the end of a chunk, and that can't happen more than once
every dozen or so chunks; in contrast, progressively growing the TID
arrays for a given 1GB chunk would potentially memcpy() multiple times
per 1GB chunk, and if you used power-of-two reallocation as we
normally do the waste would be much more than what step 4 of this
algorithm leaves on the table.

There are, nevertheless, corner cases where this can lose: if you had
a number of TIDs that were going to just BARELY fit within
maintenance_work_mem, and if the bitmap representation never wins for
you, the additional array allocated in step 1 and the end-of-slab
wastage in step 4 could push you over the line.  We might be able to
tweak things here or there to reduce the potential for that, but it's
pretty well unavoidable; the flat array we're using right now has
exactly zero allocator overhead, and anything more complex will have

Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[1] Or am I always out to lunch?

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to