On Tue, 1 Sep 2020 at 05:56, Peter Geoghegan <p...@bowt.ie> wrote: > > On Mon, Aug 31, 2020 at 12:22 PM Thomas Munro <thomas.mu...@gmail.com> wrote: > > On Sun, Aug 30, 2020 at 11:08 PM Masahiko Sawada > > <masahiko.saw...@2ndquadrant.com> wrote: > > > So my proposal is to add boundary value check in lazy_tid_reaped() > > > before executing bsearch(3). This will help when index vacuum happens > > > multiple times or when garbage tuples are concentrated to a narrow > > > range. > > > > Makes sense if it's often out of range. > > I also think this is a good idea. But I also wonder if it goes far > enough. Only one or two dead TIDs in inconvenient heap pages can > completely defeat the optimization. > > A related problem with the TID list is that it is very space > inefficient. It would be nice to fix that problem at some point. We > could make multiple index scans by VACUUM operations much rarer if we > tried. But how can we do all of this together? > > I wonder if Roaring bitmaps would work well for this: > > https://arxiv.org/abs/1709.07821 > > This data structure looks like it might work well in lazy_tid_reaped() > (for the TID array), because it offers effective bitmap compression > without compromising on binary search speed. Of course, you'd have to > encode TIDs as bitmaps to make use of the data structure, much like > tidbitmap.c. I imagine that the Roaring bitmap indirection would be > very CPU cache efficient in cases that are similar to the cases > Sawada-san is worried about, but a bit more complicated. > > VACUUM would be able to make the summarizing information for the TID > bitmap resident in CPU cache. Only this well-cached summarizing > information (the top-level bitmap range indirection) will need to be > accessed by most individual calls to a Roaring-bitmap-aware > lazy_tid_reaped() that return false (i.e. calls that say "don't kill > this TID, it's alive"). These performance characteristics seem likely > when a certain amount of clustering of dead tuples takes place in the > heap. I bet having completely random positions for dead TIDs almost > never happens -- *some* clustering is practically certain to take > place, even without natural locality in the data (which is itself very > common). Think about how opportunistic pruning accumulates many > LP_DEAD items in heap pages. > > There is a reference Roaring bitmap implementation in C, so it's > probably not that hard to experimentally determine how well it would > work: > > https://github.com/RoaringBitmap/CRoaring > > Lots of open source projects already use it, so there are no problems > with patents. Seems worth investigating. (I also wonder if we could > use it for clog.)
Very interesting. Before getting into CRoaring bitmap, I've done some experiments with three different methods of recording dead tuple TIDs: array, array-minmax, and integer set. 'array' is the current method lazy vacuum uses. It just adds dead tuple TIDs to the simple array of ItemPointerData. 'array-minmax' is the same as 'array' method except for checking min and max boundaries when deleting index dead tuples (i.g., in lazy_tid_reaped()). 'intset' uses the integer set (src/backend/lib/integerset.c) to record dead tuples TIDs. It's an in-memory data structure to hold 64-bit integers. In the experiments, I created the table with 4GB and made 20% of total tuples dirty in all test cases while changing the distribution of where dead tuples exist within the table. The table has only one index, therefore I did't use parallel vacuum. I set enough maintenance_work_mem so that the index scan runs only once. Here is the result, showing the "total execution time in second (heap scan time/index vacuum time/table vacuum time/memory usage in MB)”. The numbers are round down to the nearest decimal. 1. Updated evenly the table (every block has at least one dead tuple). array : 22 (8/12/2/114) array-minmax : 20 (8/11/2/114) intset : 17 (7/8/1/19) 2. Updated the middle of the table. array : 25 (6/19/0/114) array-minmax : 17 (6/11/0/114) intset : 17 (6/11/0/7) 3. Updated the tail of the table. array : 25 (6/19/0/114) array-minmax : 18 (6/11/0/114) intset : 18 (6/11/0/5) 4. Updated both the beginning and the tail of the table. array : 25 (6/19/0/114) array-minmax : 23 (6/17/0/114) intset : 21 (6/14/0/6) The memory usage is calculated by (# of dead tuples * sizeof(ItemPointerData)) in both ‘array’ and ‘array-minmax’ case, although the actual amount of palloc'd memory is different, and by intset_memory_usage() in ‘intset’ case. Using the integer set is very memory efficient (5MB vs. 114MB in the base case) and there is no 1GB limitation. Looking at the execution time, I had expected that using the integer set is slower on recording TIDs than using the simple array but in this experiment, there is no big difference among methods. Perhaps the result will vary if vacuum needs to record much more dead tuple TIDs. From the results, I can see a good improvement in the integer set case and probably a good start but if we want to use it also for lazy vacuum, we will need to improve it so that it can be used on DSA for the parallel vacuum. I've attached the patch I used for the experiment that adds xx_vacuum GUC parameter that controls the method of recording TIDs. Please note that it doesn't support parallel vacuum. Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c index 4f2f38168d..304dae0b25 100644 --- a/src/backend/access/heap/vacuumlazy.c +++ b/src/backend/access/heap/vacuumlazy.c @@ -80,6 +80,9 @@ #include "utils/pg_rusage.h" #include "utils/timestamp.h" +#include "lib/integerset.h" +#include "catalog/index.h" + /* * Space/time tradeoff parameters: do these need to be user-tunable? @@ -167,6 +170,8 @@ typedef struct LVDeadTuples { int max_tuples; /* # slots allocated in array */ int num_tuples; /* current # of entries */ + MemoryContext ctx; + IntegerSet *intset; /* List of TIDs of tuples we intend to delete */ /* NB: this list is ordered by TID address */ ItemPointerData itemptrs[FLEXIBLE_ARRAY_MEMBER]; /* array of @@ -338,6 +343,7 @@ static MultiXactId MultiXactCutoff; static BufferAccessStrategy vac_strategy; +int xx_vacuum = VACUUM_ARRAY; /* non-export function prototypes */ static void lazy_scan_heap(Relation onerel, VacuumParams *params, @@ -405,6 +411,41 @@ static void update_vacuum_error_info(LVRelStats *errinfo, LVSavedErrInfo *saved_ OffsetNumber offnum); static void restore_vacuum_error_info(LVRelStats *errinfo, const LVSavedErrInfo *saved_err_info); +static inline bool +check_memory_limit(LVDeadTuples *dead_tuples) +{ + if (xx_vacuum == VACUUM_ARRAY || xx_vacuum == VACUUM_ARRAY_MINMAX) + return ((dead_tuples->max_tuples - dead_tuples->num_tuples) < MaxHeapTuplesPerPage) && + dead_tuples->num_tuples > 0; + else + return intset_memory_usage(dead_tuples->intset) >= (maintenance_work_mem * 1000L); +} + +static inline int +get_num_dead_tuples(LVDeadTuples *dead_tuples) +{ + if (xx_vacuum == VACUUM_ARRAY || xx_vacuum == VACUUM_ARRAY_MINMAX) + return dead_tuples->num_tuples; + else + return intset_num_entries(dead_tuples->intset); +} + +static void +init_dead_tuple_intset(LVDeadTuples *dead_tuples) +{ + MemoryContext oldctx; + + if (dead_tuples->ctx) + MemoryContextDelete(dead_tuples->ctx); + + dead_tuples->ctx = GenerationContextCreate(CurrentMemoryContext, + "dead tuple intset", + 16 * 1024); + oldctx = MemoryContextSwitchTo(dead_tuples->ctx); + dead_tuples->intset = intset_create(); + MemoryContextSwitchTo(oldctx); +} + /* * heap_vacuum_rel() -- perform VACUUM for one heap relation @@ -1036,8 +1077,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats, * If we are close to overrunning the available space for dead-tuple * TIDs, pause and do a cycle of vacuuming before we tackle this page. */ - if ((dead_tuples->max_tuples - dead_tuples->num_tuples) < MaxHeapTuplesPerPage && - dead_tuples->num_tuples > 0) + if (check_memory_limit(vacrelstats->dead_tuples)) { /* * Before beginning index vacuuming, we release any pin we may @@ -1064,6 +1104,8 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats, * valid. */ dead_tuples->num_tuples = 0; + if (xx_vacuum == VACUUM_INTSET) + init_dead_tuple_intset(dead_tuples); /* * Vacuum the Free Space Map to make newly-freed space visible on @@ -1250,7 +1292,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats, has_dead_tuples = false; nfrozen = 0; hastup = false; - prev_dead_count = dead_tuples->num_tuples; + prev_dead_count = get_num_dead_tuples(dead_tuples); maxoff = PageGetMaxOffsetNumber(page); /* @@ -1702,7 +1744,7 @@ lazy_scan_heap(Relation onerel, VacuumParams *params, LVRelStats *vacrelstats, /* If any tuples need to be deleted, perform final vacuum cycle */ /* XXX put a threshold on min number of tuples here? */ - if (dead_tuples->num_tuples > 0) + if (get_num_dead_tuples(dead_tuples) > 0) { /* Work on all the indexes, and then the heap */ lazy_vacuum_all_indexes(onerel, Irel, indstats, vacrelstats, @@ -1861,35 +1903,78 @@ lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats) npages = 0; tupindex = 0; - while (tupindex < vacrelstats->dead_tuples->num_tuples) + + if (xx_vacuum == VACUUM_INTSET) { - BlockNumber tblk; - Buffer buf; - Page page; - Size freespace; + uint64 val; - vacuum_delay_point(); + intset_begin_iterate(vacrelstats->dead_tuples->intset); - tblk = ItemPointerGetBlockNumber(&vacrelstats->dead_tuples->itemptrs[tupindex]); - vacrelstats->blkno = tblk; - buf = ReadBufferExtended(onerel, MAIN_FORKNUM, tblk, RBM_NORMAL, - vac_strategy); - if (!ConditionalLockBufferForCleanup(buf)) + while (tupindex != -1 && + intset_iterate_next(vacrelstats->dead_tuples->intset, &val)) { - ReleaseBuffer(buf); - ++tupindex; - continue; + BlockNumber tblk; + Buffer buf; + Page page; + Size freespace; + ItemPointerData itemptr; + + vacuum_delay_point(); + itemptr_decode(&itemptr, val); + tblk = ItemPointerGetBlockNumber(&itemptr); + vacrelstats->blkno = tblk; + buf = ReadBufferExtended(onerel, MAIN_FORKNUM, tblk, RBM_NORMAL, + vac_strategy); + if (!ConditionalLockBufferForCleanup(buf)) + { + ReleaseBuffer(buf); + ++tupindex; + continue; + } + tupindex = lazy_vacuum_page(onerel, tblk, buf, tupindex, vacrelstats, + &vmbuffer); + + /* Now that we've compacted the page, record its available space */ + page = BufferGetPage(buf); + freespace = PageGetHeapFreeSpace(page); + + UnlockReleaseBuffer(buf); + RecordPageWithFreeSpace(onerel, tblk, freespace); + npages++; } - tupindex = lazy_vacuum_page(onerel, tblk, buf, tupindex, vacrelstats, - &vmbuffer); + } + else + { + while (tupindex < vacrelstats->dead_tuples->num_tuples) + { + BlockNumber tblk; + Buffer buf; + Page page; + Size freespace; - /* Now that we've compacted the page, record its available space */ - page = BufferGetPage(buf); - freespace = PageGetHeapFreeSpace(page); + vacuum_delay_point(); - UnlockReleaseBuffer(buf); - RecordPageWithFreeSpace(onerel, tblk, freespace); - npages++; + tblk = ItemPointerGetBlockNumber(&vacrelstats->dead_tuples->itemptrs[tupindex]); + vacrelstats->blkno = tblk; + buf = ReadBufferExtended(onerel, MAIN_FORKNUM, tblk, RBM_NORMAL, + vac_strategy); + if (!ConditionalLockBufferForCleanup(buf)) + { + ReleaseBuffer(buf); + ++tupindex; + continue; + } + tupindex = lazy_vacuum_page(onerel, tblk, buf, tupindex, vacrelstats, + &vmbuffer); + + /* Now that we've compacted the page, record its available space */ + page = BufferGetPage(buf); + freespace = PageGetHeapFreeSpace(page); + + UnlockReleaseBuffer(buf); + RecordPageWithFreeSpace(onerel, tblk, freespace); + npages++; + } } /* Clear the block number information */ @@ -1901,10 +1986,19 @@ lazy_vacuum_heap(Relation onerel, LVRelStats *vacrelstats) vmbuffer = InvalidBuffer; } + ereport(elevel, + (errmsg("%s memory usage %lu", + xx_vacuum == VACUUM_INTSET ? "INTSET": + xx_vacuum == VACUUM_ARRAY_MINMAX ? "ARRAY_MINMAX": + "ARRAY", + xx_vacuum == VACUUM_INTSET ? + intset_memory_usage(vacrelstats->dead_tuples->intset) + : (sizeof(ItemPointerData) * vacrelstats->dead_tuples->num_tuples)))); ereport(elevel, (errmsg("\"%s\": removed %d row versions in %d pages", vacrelstats->relname, - tupindex, npages), + get_num_dead_tuples(vacrelstats->dead_tuples), + npages), errdetail_internal("%s", pg_rusage_show(&ru0)))); /* Revert to the previous phase information for error traceback */ @@ -1941,19 +2035,50 @@ lazy_vacuum_page(Relation onerel, BlockNumber blkno, Buffer buffer, START_CRIT_SECTION(); - for (; tupindex < dead_tuples->num_tuples; tupindex++) + if (xx_vacuum == VACUUM_INTSET) { - BlockNumber tblk; - OffsetNumber toff; - ItemId itemid; + uint64 val; + bool nextblk = false; + + while (intset_iterate_next(vacrelstats->dead_tuples->intset, &val)) + { + BlockNumber tblk; + OffsetNumber toff; + ItemId itemid; + ItemPointerData itemptr; + + itemptr_decode(&itemptr, val); + tblk = ItemPointerGetBlockNumber(&itemptr); + if (tblk != blkno) + { + nextblk = true; + break; /* past end of tuples for this block */ + } + toff = ItemPointerGetOffsetNumber(&itemptr); + itemid = PageGetItemId(page, toff); + ItemIdSetUnused(itemid); + unused[uncnt++] = toff; + } + + if (!nextblk) + tupindex = -1; + } + else + { + for (; tupindex < dead_tuples->num_tuples; tupindex++) + { + BlockNumber tblk; + OffsetNumber toff; + ItemId itemid; - tblk = ItemPointerGetBlockNumber(&dead_tuples->itemptrs[tupindex]); - if (tblk != blkno) - break; /* past end of tuples for this block */ - toff = ItemPointerGetOffsetNumber(&dead_tuples->itemptrs[tupindex]); - itemid = PageGetItemId(page, toff); - ItemIdSetUnused(itemid); - unused[uncnt++] = toff; + tblk = ItemPointerGetBlockNumber(&dead_tuples->itemptrs[tupindex]); + if (tblk != blkno) + break; /* past end of tuples for this block */ + toff = ItemPointerGetOffsetNumber(&dead_tuples->itemptrs[tupindex]); + itemid = PageGetItemId(page, toff); + ItemIdSetUnused(itemid); + unused[uncnt++] = toff; + } } PageRepairFragmentation(page); @@ -2471,7 +2596,7 @@ lazy_vacuum_index(Relation indrel, IndexBulkDeleteResult **stats, ereport(elevel, (errmsg(msg, vacrelstats->indname, - dead_tuples->num_tuples), + get_num_dead_tuples(dead_tuples)), errdetail_internal("%s", pg_rusage_show(&ru0)))); /* Revert to the previous phase information for error traceback */ @@ -2892,13 +3017,24 @@ static void lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks) { LVDeadTuples *dead_tuples = NULL; - long maxtuples; - maxtuples = compute_max_dead_tuples(relblocks, vacrelstats->useindex); + if (xx_vacuum == VACUUM_INTSET) + { + dead_tuples = (LVDeadTuples *) palloc(sizeof(LVDeadTuples)); + dead_tuples->num_tuples = 0; + dead_tuples->ctx = NULL; + init_dead_tuple_intset(dead_tuples); + } + else + { + long maxtuples; + + maxtuples = compute_max_dead_tuples(relblocks, vacrelstats->useindex); - dead_tuples = (LVDeadTuples *) palloc(SizeOfDeadTuples(maxtuples)); - dead_tuples->num_tuples = 0; - dead_tuples->max_tuples = (int) maxtuples; + dead_tuples = (LVDeadTuples *) palloc(SizeOfDeadTuples(maxtuples)); + dead_tuples->num_tuples = 0; + dead_tuples->max_tuples = (int) maxtuples; + } vacrelstats->dead_tuples = dead_tuples; } @@ -2909,17 +3045,29 @@ lazy_space_alloc(LVRelStats *vacrelstats, BlockNumber relblocks) static void lazy_record_dead_tuple(LVDeadTuples *dead_tuples, ItemPointer itemptr) { - /* - * The array shouldn't overflow under normal behavior, but perhaps it - * could if we are given a really small maintenance_work_mem. In that - * case, just forget the last few tuples (we'll get 'em next time). - */ - if (dead_tuples->num_tuples < dead_tuples->max_tuples) + if (xx_vacuum == VACUUM_INTSET) + { + uint64 val = itemptr_encode(itemptr); + + if (intset_memory_usage(dead_tuples->intset) >= (maintenance_work_mem * 1000L)) + return; + + intset_add_member(dead_tuples->intset, val); + } + else { - dead_tuples->itemptrs[dead_tuples->num_tuples] = *itemptr; - dead_tuples->num_tuples++; - pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES, - dead_tuples->num_tuples); + /* + * The array shouldn't overflow under normal behavior, but perhaps it + * could if we are given a really small maintenance_work_mem. In that + * case, just forget the last few tuples (we'll get 'em next time). + */ + if (dead_tuples->num_tuples < dead_tuples->max_tuples) + { + dead_tuples->itemptrs[dead_tuples->num_tuples] = *itemptr; + dead_tuples->num_tuples++; + pgstat_progress_update_param(PROGRESS_VACUUM_NUM_DEAD_TUPLES, + dead_tuples->num_tuples); + } } } @@ -2934,15 +3082,40 @@ static bool lazy_tid_reaped(ItemPointer itemptr, void *state) { LVDeadTuples *dead_tuples = (LVDeadTuples *) state; - ItemPointer res; - res = (ItemPointer) bsearch((void *) itemptr, + if (xx_vacuum == VACUUM_INTSET) + return intset_is_member(dead_tuples->intset, + itemptr_encode(itemptr)); + else if (xx_vacuum == VACUUM_ARRAY_MINMAX) + { + ItemPointer res; + BlockNumber minblk, maxblk, blk; + + blk = ItemPointerGetBlockNumber(itemptr); + minblk = ItemPointerGetBlockNumber(&(dead_tuples->itemptrs[0])); + maxblk = ItemPointerGetBlockNumber(&(dead_tuples->itemptrs[dead_tuples->num_tuples - 1])); + + if (blk < minblk || blk > maxblk) + return false; + + res = (ItemPointer) bsearch((void *) itemptr, (void *) dead_tuples->itemptrs, dead_tuples->num_tuples, sizeof(ItemPointerData), vac_cmp_itemptr); + return (res != NULL); - return (res != NULL); + } + else + { + ItemPointer res; + res = (ItemPointer) bsearch((void *) itemptr, + (void *) dead_tuples->itemptrs, + dead_tuples->num_tuples, + sizeof(ItemPointerData), + vac_cmp_itemptr); + return (res != NULL); + } } /* diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c index a62d64eaa4..8b3036a32a 100644 --- a/src/backend/utils/misc/guc.c +++ b/src/backend/utils/misc/guc.c @@ -482,6 +482,13 @@ const struct config_enum_entry ssl_protocol_versions_info[] = { {NULL, 0, false} }; +const struct config_enum_entry xx_vacuum_options[] = { + {"array", VACUUM_ARRAY, false}, + {"intset", VACUUM_INTSET, false}, + {"array-minmax", VACUUM_ARRAY_MINMAX, false}, + {NULL, 0, false} +}; + StaticAssertDecl(lengthof(ssl_protocol_versions_info) == (PG_TLS1_3_VERSION + 2), "array length mismatch"); @@ -4457,6 +4464,15 @@ static struct config_string ConfigureNamesString[] = static struct config_enum ConfigureNamesEnum[] = { + { + {"xx_vacuum", PGC_USERSET, RESOURCES, + gettext_noop("vacuum strategy."), + }, + &xx_vacuum, + VACUUM_ARRAY, xx_vacuum_options, + NULL, NULL, NULL + }, + { {"backslash_quote", PGC_USERSET, COMPAT_OPTIONS_PREVIOUS, gettext_noop("Sets whether \"\\'\" is allowed in string literals."), diff --git a/src/include/commands/vacuum.h b/src/include/commands/vacuum.h index d9475c9989..ca50589a2b 100644 --- a/src/include/commands/vacuum.h +++ b/src/include/commands/vacuum.h @@ -23,6 +23,13 @@ #include "storage/lock.h" #include "utils/relcache.h" +enum xx_vacuum_options +{ + VACUUM_ARRAY = 0, + VACUUM_INTSET, + VACUUM_ARRAY_MINMAX +}; + /* * Flags for amparallelvacuumoptions to control the participation of bulkdelete * and vacuumcleanup in parallel vacuum. @@ -243,6 +250,7 @@ extern pg_atomic_uint32 *VacuumSharedCostBalance; extern pg_atomic_uint32 *VacuumActiveNWorkers; extern int VacuumCostBalanceLocal; +extern int xx_vacuum; /* in commands/vacuum.c */ extern void ExecVacuum(ParseState *pstate, VacuumStmt *vacstmt, bool isTopLevel);