On Wed, Sep 14, 2016 at 1:23 PM, Alvaro Herrera <alvhe...@2ndquadrant.com> wrote: > Robert Haas wrote: >> Actually, I think that probably *is* worthwhile, specifically because >> it might let us avoid multiple index scans in cases where we currently >> require them. Right now, our default maintenance_work_mem value is >> 64MB, which is enough to hold a little over ten million tuples. It's >> also large enough to hold a bitmap for a 14GB table. So right now if >> you deleted, say, 100 tuples per page you would end up with an index >> vacuum cycles for every ~100,000 pages = 800MB, whereas switching to >> the bitmap representation for such cases would require only one index >> vacuum cycle for every 14GB, more than an order of magnitude >> improvement! > > Yeah, this sounds worthwhile. If we switch to the more compact > in-memory representation close to the point where we figure the TID > array is not going to fit in m_w_m, then we're saving some number of > additional index scans, and I'm pretty sure that the time to transform > from array to bitmap is going to be more than paid back by the I/O > savings.
Yes, that seems pretty clear. The indexes can be arbitrarily large and there can be arbitrarily many of them, so we could be save a LOT of I/O. > One thing not quite clear to me is how do we create the bitmap > representation starting from the array representation in midflight > without using twice as much memory transiently. Are we going to write > the array to a temp file, free the array memory, then fill the bitmap by > reading the array from disk? I was just thinking about this exact problem while I was out to lunch. I wonder if we could do something like this: 1. Allocate an array large enough for one pointer per gigabyte of the underlying relation. 2. Allocate 64MB, or the remaining amount of maintenance_work_mem if it's less, to store TIDs. 3. At the beginning of each 1GB chunk, add a pointer to the first free byte in the slab allocated in step 2 to the array allocated in step 1. Write a header word that identifies this as a TID list (rather than a bitmap) and leave space for a TID count; then, write the TIDs afterwards. Continue doing this until one of the following things happens: (a) we reach the end of the 1GB chunk - if that happens, restart step 3 for the next chunk; (b) we fill the chunk - see step 4, or (c) we write more TIDs for the chunk than the space being used for TIDs now exceeds the space needed for a bitmap - see step 5. 4. When we fill up one of the slabs allocated in step 2, allocate a new one and move the tuples for the current 1GB chunk to the beginning of the new slab using memmove(). This is wasteful of both CPU time and memory, but I think it's not that bad. The maximum possible waste is less than 10%, and many allocators have more overhead than that. We could reduce the waste by using, say, 256MB chunks rather than 1GB chunks. If no new slab can be allocated because maintenance_work_mem is completely exhausted (or the remaining space isn't enough for the TIDs that would need to be moved immediately), then stop and do an index vacuum cycle. 5. When we write a large enough number of TIDs for 1GB chunk that the bitmap would be smaller, check whether sufficient maintenance_work_mem remains to allocate a bitmap for 1GB chunk (~4.5MB). If not, never mind; continue with step 3 as if the bitmap representation did not exist. If so, allocate space for a bitmap, move all of the TIDs for the current chunk into it, and update the array allocated in step 1 to point to it. Then, finish scanning the current 1GB chunk, updating that bitmap rather than inserting TIDs into the slab. Rewind our pointer into the slab to where it was at the beginning of the current 1GB chunk, so that the memory we consumed for TIDs can be reused now that those TIDs have been transferred to a bitmap. If, earlier in the current 1GB chunk, we did a memmove-to-next-slab operation as described in step 4, this "rewind" might move our pointer back into the previous slab, in which case we can free the now-empty slab. (The next 1GB segment might have few enough TIDs that they will fit into the leftover space in the previous slab.) With this algorithm, we never exceed maintenance_work_mem, not even transiently. When memory is no longer sufficient to convert to the bitmap representation without bursting above maintenance_work_mem, we simply don't perform the conversion. Also, we do very little memory copying. An alternative I considered was to do a separate allocation for each 1GB chunk rather than carving the dead-tuple space out of slabs. But the problem with that is that you'll have to start those out small (in case you don't find many dead tuples) and then grow them, which means reallocating, which is bad both because it can burst above maintenance_work_mem while the repalloc is in process and also because you have to keep copying the data from the old chunk to the new, bigger chunk. This algorithm only needs to copy TIDs when it runs off the end of a chunk, and that can't happen more than once every dozen or so chunks; in contrast, progressively growing the TID arrays for a given 1GB chunk would potentially memcpy() multiple times per 1GB chunk, and if you used power-of-two reallocation as we normally do the waste would be much more than what step 4 of this algorithm leaves on the table. There are, nevertheless, corner cases where this can lose: if you had a number of TIDs that were going to just BARELY fit within maintenance_work_mem, and if the bitmap representation never wins for you, the additional array allocated in step 1 and the end-of-slab wastage in step 4 could push you over the line. We might be able to tweak things here or there to reduce the potential for that, but it's pretty well unavoidable; the flat array we're using right now has exactly zero allocator overhead, and anything more complex will have some. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company  Or am I always out to lunch? -- Sent via pgsql-hackers mailing list (firstname.lastname@example.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers