On Tue, Sep 6, 2016 at 10:28 PM, Claudio Freire <klaussfre...@gmail.com> wrote: >> The problem with this is that we allocate the entire amount of >> maintenance_work_mem even when the number of actual dead tuples turns >> out to be very small. That's not so bad if the amount of memory we're >> potentially wasting is limited to ~1 GB, but it seems pretty dangerous >> to remove the 1 GB limit, because somebody might have >> maintenance_work_mem set to tens or hundreds of gigabytes to speed >> index creation, and allocating that much space for a VACUUM that >> encounters 1 dead tuple does not seem like a good plan. >> >> What I think we need to do is make some provision to initially >> allocate only a small amount of memory and then grow the allocation >> later if needed. For example, instead of having >> vacrelstats->dead_tuples be declared as ItemPointer, declare it as >> ItemPointer * and allocate the array progressively in segments. I'd >> actually argue that the segment size should be substantially smaller >> than 1 GB, like say 64MB; there are still some people running systems >> which are small enough that allocating 1 GB when we may need only 6 >> bytes can drive the system into OOM. > > This would however incur the cost of having to copy the whole GB-sized > chunk every time it's expanded. It woudln't be cheap.
No, I don't want to end up copying the whole array; that's what I meant by allocating it progressively in segments. Something like what you go on to propose. > I've monitored the vacuum as it runs and the OS doesn't map the whole > block unless it's touched, which it isn't until dead tuples are found. > Surely, if overcommit is disabled (as it should), it could exhaust the > virtual address space if set very high, but it wouldn't really use the > memory unless it's needed, it would merely reserve it. Yeah, but I've seen actual breakage from exactly this issue on customer systems even with the 1GB limit, and when we start allowing 100GB it's going to get a whole lot worse. > To fix that, rather than repalloc the whole thing, dead_tuples would > have to be an ItemPointer** of sorted chunks. That'd be a > significantly more complex patch, but at least it wouldn't incur the > memcpy. Right, this is what I had in mind. I don't think this is actually very complicated, because the way we use this array is really simple. We basically just keep appending to the array until we run out of space, and that's not very hard to implement with an array-of-arrays. The chunks are, in some sense, sorted, as you say, but you don't need to do qsort() or anything like that. You're just replacing a single flat array with a data structure that can be grown incrementally in fixed-size chunks. > I could attempt that, but I don't see the difference between > vacuum and create index in this case. Both could allocate a huge chunk > of the virtual address space if maintainance work mem says so, both > proportional to the size of the table. I can't see how that could take > any DBA by surprise. Really? CREATE INDEX isn't going to allocate more storage space than the size of the data actually being sorted, because tuplesort.c is smart about that kind of thing. But VACUUM will very happily allocate vastly more memory than the number of dead tuples. It is thankfully smart enough not to allocate more storage than the number of line pointers that could theoretically exist in a relation of the given size, but that only helps for very small relations. In a large relation that divergence between the amount of storage space that could theoretically be needed and the amount that is actually needed is likely to be extremely high. 1 TB relation = 2^27 blocks, each of which can contain MaxHeapTuplesPerPage dead line pointers. On my system, MaxHeapTuplesPerPage is 291, so that's 291 * 2^27 possible dead line pointers, which at 6 bytes each is 291 * 6 * 2^27 = ~218GB, but the expected number of dead line pointers is much less than that. Even if this is a vacuum triggered by autovacuum_vacuum_scale_factor and you're using the default of 0.2 (probably too high for such a large table), assuming there are about 60 tuples for page (which is what I get with pgbench -i) the table would have about 2^27 * 60 = 7.7 billion tuples of which 1.5 billion would be dead, meaning we need about 9-10GB of space to store all of those dead tuples. Allocating as much as 218GB when we need 9-10GB is going to sting, and I don't see how you will get a comparable distortion with CREATE INDEX. I might be missing something, though. There's no real issue when there's only one process running on the system at a time. If the user set maintenance_work_mem to an amount of memory that he can't afford to pay even once, then that's simple misconfiguration and it's not really our problem. The issue is that when there are 3 or potentially more VACUUM processes running plus a CREATE INDEX or two at the same time. If you set maintenance_work_mem to a value that is large enough to make the CREATE INDEX run fast, now with your patch that is also going to cause each VACUUM process to gobble up lots of extra memory that it probably doesn't need, and now you may well start to get failures. I've seen this happen even with the current 1GB limit, though you need a pretty small system - e.g. 8GB RAM - for it to be a problem. I think it is really really likely to cause big problems for us if we dramatically increase that limit without making the allocation algorithm smarter. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (email@example.com) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers