On Tue, Sep 6, 2016 at 10:28 PM, Claudio Freire <klaussfre...@gmail.com> wrote:
>> The problem with this is that we allocate the entire amount of
>> maintenance_work_mem even when the number of actual dead tuples turns
>> out to be very small.  That's not so bad if the amount of memory we're
>> potentially wasting is limited to ~1 GB, but it seems pretty dangerous
>> to remove the 1 GB limit, because somebody might have
>> maintenance_work_mem set to tens or hundreds of gigabytes to speed
>> index creation, and allocating that much space for a VACUUM that
>> encounters 1 dead tuple does not seem like a good plan.
>> What I think we need to do is make some provision to initially
>> allocate only a small amount of memory and then grow the allocation
>> later if needed.  For example, instead of having
>> vacrelstats->dead_tuples be declared as ItemPointer, declare it as
>> ItemPointer * and allocate the array progressively in segments.  I'd
>> actually argue that the segment size should be substantially smaller
>> than 1 GB, like say 64MB; there are still some people running systems
>> which are small enough that allocating 1 GB when we may need only 6
>> bytes can drive the system into OOM.
> This would however incur the cost of having to copy the whole GB-sized
> chunk every time it's expanded. It woudln't be cheap.

No, I don't want to end up copying the whole array; that's what I
meant by allocating it progressively in segments.  Something like what
you go on to propose.

> I've monitored the vacuum as it runs and the OS doesn't map the whole
> block unless it's touched, which it isn't until dead tuples are found.
> Surely, if overcommit is disabled (as it should), it could exhaust the
> virtual address space if set very high, but it wouldn't really use the
> memory unless it's needed, it would merely reserve it.

Yeah, but I've seen actual breakage from exactly this issue on
customer systems even with the 1GB limit, and when we start allowing
100GB it's going to get a whole lot worse.

> To fix that, rather than repalloc the whole thing, dead_tuples would
> have to be an ItemPointer** of sorted chunks. That'd be a
> significantly more complex patch, but at least it wouldn't incur the
> memcpy.

Right, this is what I had in mind.  I don't think this is actually
very complicated, because the way we use this array is really simple.
We basically just keep appending to the array until we run out of
space, and that's not very hard to implement with an array-of-arrays.
The chunks are, in some sense, sorted, as you say, but you don't need
to do qsort() or anything like that.  You're just replacing a single
flat array with a data structure that can be grown incrementally in
fixed-size chunks.

> I could attempt that, but I don't see the difference between
> vacuum and create index in this case. Both could allocate a huge chunk
> of the virtual address space if maintainance work mem says so, both
> proportional to the size of the table. I can't see how that could take
> any DBA by surprise.

Really?  CREATE INDEX isn't going to allocate more storage space than
the size of the data actually being sorted, because tuplesort.c is
smart about that kind of thing.  But VACUUM will very happily allocate
vastly more memory than the number of dead tuples.  It is thankfully
smart enough not to allocate more storage than the number of line
pointers that could theoretically exist in a relation of the given
size, but that only helps for very small relations.  In a large
relation that divergence between the amount of storage space that
could theoretically be needed and the amount that is actually needed
is likely to be extremely high.  1 TB relation = 2^27 blocks, each of
which can contain MaxHeapTuplesPerPage dead line pointers.  On my
system, MaxHeapTuplesPerPage is 291, so that's 291 * 2^27 possible
dead line pointers, which at 6 bytes each is 291 * 6 * 2^27 = ~218GB,
but the expected number of dead line pointers is much less than that.
Even if this is a vacuum triggered by autovacuum_vacuum_scale_factor
and you're using the default of 0.2 (probably too high for such a
large table), assuming there are about 60 tuples for page (which is
what I get with pgbench -i) the table would have about 2^27 * 60 = 7.7
billion tuples of which 1.5 billion would be dead, meaning we need
about 9-10GB of space to store all of those dead tuples.  Allocating
as much as 218GB when we need 9-10GB is going to sting, and I don't
see how you will get a comparable distortion with CREATE INDEX.  I
might be missing something, though.

There's no real issue when there's only one process running on the
system at a time.  If the user set maintenance_work_mem to an amount
of memory that he can't afford to pay even once, then that's simple
misconfiguration and it's not really our problem.  The issue is that
when there are 3 or potentially more VACUUM processes running plus a
CREATE INDEX or two at the same time.  If you set maintenance_work_mem
to a value that is large enough to make the CREATE INDEX run fast, now
with your patch that is also going to cause each VACUUM process to
gobble up lots of extra memory that it probably doesn't need, and now
you may well start to get failures.  I've seen this happen even with
the current 1GB limit, though you need a pretty small system - e.g.
8GB RAM - for it to be a problem.  I think it is really really likely
to cause big problems for us if we dramatically increase that limit
without making the allocation algorithm smarter.

Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to