On Mon, Feb 20, 2023 at 2:56 PM Masahiko Sawada <sawada.m...@gmail.com> wrote: > > On Thu, Feb 16, 2023 at 6:23 PM John Naylor > <john.nay...@enterprisedb.com> wrote: > > > > On Thu, Feb 16, 2023 at 10:24 AM Masahiko Sawada <sawada.m...@gmail.com> > > wrote: > > > > > > On Tue, Feb 14, 2023 at 8:24 PM John Naylor > > > <john.nay...@enterprisedb.com> wrote: > > > > > > > I can think that something like traversing a HOT chain could visit > > > > > offsets out of order. But fortunately we prune such collected TIDs > > > > > before heap vacuum in heap case. > > > > > > > > Further, currently we *already* assume we populate the tid array in > > > > order (for binary search), so we can just continue assuming that (with > > > > an assert added since it's more public in this form). I'm not sure why > > > > such basic common sense evaded me a few versions ago... > > > > > > Right. TidStore is implemented not only for heap, so loading > > > out-of-order TIDs might be important in the future. > > > > That's what I was probably thinking about some weeks ago, but I'm having a > > hard time imagining how it would come up, even for something like the > > conveyor-belt concept. > > > > > We have the following WIP comment in test_radixtree: > > > > > > // WIP: compiles with warnings because rt_attach is defined but not used > > > // #define RT_SHMEM > > > > > > How about unsetting RT_SCOPE to suppress warnings for unused rt_attach > > > and friends? > > > > Sounds good to me, and the other fixes make sense as well. > > Thanks, I merged them. > > > > > > FYI I've briefly tested the TidStore with blocksize = 32kb, and it > > > seems to work fine. > > > > That was on my list, so great! How about the other end -- nominally we > > allow 512b. (In practice it won't matter, but this would make sure I didn't > > mess anything up when forcing all MaxTuplesPerPage to encode.) > > According to the doc, the minimum block size is 1kB. It seems to work > fine with 1kB blocks. > > > > > > You removed the vacuum integration patch from v27, is there any reason > > > for that? > > > > Just an oversight. > > > > Now for some general comments on the tid store... > > > > + * TODO: The caller must be certain that no other backend will attempt to > > + * access the TidStore before calling this function. Other backend must > > + * explicitly call tidstore_detach to free up backend-local memory > > associated > > + * with the TidStore. The backend that calls tidstore_destroy must not call > > + * tidstore_detach. > > + */ > > +void > > +tidstore_destroy(TidStore *ts) > > > > Do we need to do anything for this todo? > > Since it's practically no problem, I think we can live with it for > now. dshash also has the same todo. > > > > > It might help readability to have a concept of "off_upper/off_lower", just > > so we can describe things more clearly. The key is block + off_upper, and > > the value is a bitmap of all the off_lower bits. I hinted at that in my > > addition of encode_key_off(). Along those lines, maybe > > s/TIDSTORE_OFFSET_MASK/TIDSTORE_OFFSET_LOWER_MASK/. Actually, I'm not even > > sure the TIDSTORE_ prefix is valuable for these local macros. > > > > The word "value" as a variable name is pretty generic in this context, and > > it might be better to call it the off_lower_bitmap, at least in some > > places. The "key" doesn't have a good short term for naming, but in > > comments we should make sure we're clear it's "block# + off_upper". > > > > I'm not a fan of the name "tid_i", even as a temp variable -- maybe > > "compressed_tid"? > > > > maybe s/tid_to_key_off/encode_tid/ and s/encode_key_off/encode_block_offset/ > > > > It might be worth using typedefs for key and value type. Actually, since > > key type is fixed for the foreseeable future, maybe the radix tree template > > should define a key typedef? > > > > The term "result" is probably fine within the tidstore, but as a public > > name used by vacuum, it's not very descriptive. I don't have a good idea, > > though. > > > > Some files in backend/access use CamelCase for public functions, although > > it's not consistent. I think doing that for tidstore would help > > readability, since they would stand out from rt_* functions and vacuum > > functions. It's a matter of taste, though. > > > > I don't understand the control flow in tidstore_iterate_next(), or when > > BlockNumberIsValid() is true. If this is the best way to code this, it > > needs more commentary. > > The attached 0008 patch addressed all above comments on tidstore. > > > Some comments on vacuum: > > > > I think we'd better get some real-world testing of this, fairly soon. > > > > I had an idea: If it's not too much effort, it might be worth splitting it > > into two parts: one that just adds the store (not caring about its memory > > limits or progress reporting etc). During index scan, check both the new > > store and the array and log a warning (we don't want to exit or crash, > > better to try to investigate while live if possible) if the result doesn't > > match. Then perhaps set up an instance and let something like TPC-C run for > > a few days. The second patch would just restore the rest of the current > > patch. That would help reassure us it's working as designed. > > Yeah, I did a similar thing in an earlier version of tidstore patch. > Since we're trying to introduce two new components: radix tree and > tidstore, I sometimes find it hard to investigate failures happening > during lazy (parallel) vacuum due to a bug either in tidstore or radix > tree. If there is a bug in lazy vacuum, we cannot even do initdb. So > it might be a good idea to do such checks in USE_ASSERT_CHECKING (or > with another macro say DEBUG_TIDSTORE) builds. For example, TidStore > stores tids to both the radix tree and array, and checks if the > results match when lookup or iteration. It will use more memory but it > would not be a big problem in USE_ASSERT_CHECKING builds. It would > also be great if we can enable such checks on some bf animals.
I've tried this idea. Enabling this check on all debug builds (i.e., with USE_ASSERT_CHECKING macro) seems not a good idea so I use a special macro for that, TIDSTORE_DEBUG. I think we can define this macro on some bf animals (or possibly a new bf animal). Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
From 107aa2af2966c10ce750e6b410ae570462423aab Mon Sep 17 00:00:00 2001 From: Masahiko Sawada <sawada.m...@gmail.com> Date: Wed, 22 Feb 2023 14:43:15 +0900 Subject: [PATCH v29 11/11] Debug TIDStore. --- src/backend/access/common/tidstore.c | 242 ++++++++++++++++++++++++++- 1 file changed, 238 insertions(+), 4 deletions(-) diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c index 9360520482..438bf0c800 100644 --- a/src/backend/access/common/tidstore.c +++ b/src/backend/access/common/tidstore.c @@ -28,12 +28,20 @@ #include "postgres.h" #include "access/tidstore.h" +#include "catalog/index.h" #include "miscadmin.h" #include "port/pg_bitutils.h" #include "storage/lwlock.h" #include "utils/dsa.h" #include "utils/memutils.h" +#define TIDSTORE_DEBUG + +/* Enable TidStore debugging only when USE_ASSERT_CHECKING */ +#if defined(TIDSTORE_DEBUG) && !defined(USE_ASSERT_CHECKING) +#undef TIDSTORE_DEBUG +#endif + /* * For encoding purposes, a tid is represented as a pair of 64-bit key and * 64-bit value. @@ -115,6 +123,12 @@ typedef struct TidStoreControl /* handles for TidStore and radix tree */ TidStoreHandle handle; shared_rt_handle tree_handle; + +#ifdef TIDSTORE_DEBUG + dsm_handle tids_handle; + int64 max_tids; + bool tids_unordered; +#endif } TidStoreControl; /* Per-backend state for a TidStore */ @@ -135,6 +149,11 @@ struct TidStore /* DSA area for TidStore if used */ dsa_area *area; + +#ifdef TIDSTORE_DEBUG + dsm_segment *tids_seg; + ItemPointerData *tids; +#endif }; #define TidStoreIsShared(ts) ((ts)->area != NULL) @@ -157,6 +176,11 @@ typedef struct TidStoreIter tidkey next_tidkey; offsetbm next_off_bitmap; +#ifdef TIDSTORE_DEBUG + /* iterator index for the ts->tids array */ + int64 tids_idx; +#endif + /* * output for the caller. Must be last because variable-size. */ @@ -169,6 +193,15 @@ static inline tidkey encode_blk_off(TidStore *ts, BlockNumber block, OffsetNumber offset, offsetbm *off_bit); static inline tidkey encode_tid(TidStore *ts, ItemPointer tid, offsetbm *off_bit); +/* debug functions available only when TIDSTORE_DEBUG */ +#ifdef TIDSTORE_DEBUG +static void ts_debug_set_block_offsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets, + int num_offsets); +static void ts_debug_iter_check_tids(TidStoreIter *iter); +static bool ts_debug_is_member(TidStore *ts, ItemPointer tid); +static int itemptr_cmp(const void *left, const void *right); +#endif + /* * Create a TidStore. The returned object is allocated in backend-local memory. * The radix tree for storage is allocated in DSA area is 'area' is non-NULL. @@ -237,6 +270,26 @@ TidStoreCreate(size_t max_bytes, int max_off, dsa_area *area) ts->control->upper_off_nbits = ts->control->max_off_nbits - LOWER_OFFSET_NBITS; +#ifdef TIDSTORE_DEBUG + { + int64 max_tids = max_bytes / sizeof(ItemPointerData); + + /* Allocate the array of tids too */ + if (TidStoreIsShared(ts)) + { + ts->tids_seg = dsm_create(sizeof(ItemPointerData) * max_tids, 0); + ts->tids = dsm_segment_address(ts->tids_seg); + ts->control->tids_handle = dsm_segment_handle(ts->tids_seg); + ts->control->max_tids = max_tids; + } + else + { + ts->tids = palloc(sizeof(ItemPointerData) * max_tids); + ts->control->max_tids = max_tids; + } + } +#endif + return ts; } @@ -266,6 +319,11 @@ TidStoreAttach(dsa_area *area, TidStoreHandle handle) ts->tree.shared = shared_rt_attach(area, ts->control->tree_handle); ts->area = area; +#ifdef TIDSTORE_DEBUG + ts->tids_seg = dsm_attach(ts->control->tids_handle); + ts->tids = (ItemPointer) dsm_segment_address(ts->tids_seg); +#endif + return ts; } @@ -280,6 +338,11 @@ TidStoreDetach(TidStore *ts) { Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC); +#ifdef TIDSTORE_DEBUG + if (TidStoreIsShared(ts)) + dsm_detach(ts->tids_seg); +#endif + shared_rt_detach(ts->tree.shared); pfree(ts); } @@ -315,6 +378,13 @@ TidStoreDestroy(TidStore *ts) local_rt_free(ts->tree.local); } +#ifdef TIDSTORE_DEBUG + if (TidStoreIsShared(ts)) + dsm_detach(ts->tids_seg); + else + pfree(ts->tids); +#endif + pfree(ts); } @@ -434,6 +504,11 @@ TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets, } } +#ifdef TIDSTORE_DEBUG + /* Insert tids into the tid array too */ + ts_debug_set_block_offsets(ts, blkno, offsets, num_offsets); +#endif + /* update statistics */ ts->control->num_tids += num_offsets; @@ -451,6 +526,11 @@ TidStoreIsMember(TidStore *ts, ItemPointer tid) offsetbm off_bitmap = 0; offsetbm off_bit; bool found; + bool ret; + +#ifdef TIDSTORE_DEBUG + bool ret_debug = ts_debug_is_member(ts, tid); +#endif key = encode_tid(ts, tid, &off_bit); @@ -460,9 +540,20 @@ TidStoreIsMember(TidStore *ts, ItemPointer tid) found = local_rt_search(ts->tree.local, key, &off_bitmap); if (!found) + { +#ifdef TIDSTORE_DEBUG + Assert(!ret_debug); +#endif return false; + } + + ret = (off_bitmap & off_bit) != 0; - return (off_bitmap & off_bit) != 0; +#ifdef TIDSTORE_DEBUG + Assert(ret == ret_debug); +#endif + + return ret; } /* @@ -494,6 +585,10 @@ TidStoreBeginIterate(TidStore *ts) if (TidStoreNumTids(ts) == 0) iter->finished = true; +#ifdef TIDSTORE_DEBUG + iter->tids_idx = 0; +#endif + return iter; } @@ -515,6 +610,7 @@ TidStoreIterResult * TidStoreIterateNext(TidStoreIter *iter) { tidkey key; + bool iter_found; offsetbm off_bitmap = 0; TidStoreIterResult *output = &(iter->output); @@ -532,7 +628,7 @@ TidStoreIterateNext(TidStoreIter *iter) if (iter->next_off_bitmap > 0) iter_decode_key_off(iter, iter->next_tidkey, iter->next_off_bitmap); - while (tidstore_iter(iter, &key, &off_bitmap)) + while ((iter_found = tidstore_iter(iter, &key, &off_bitmap))) { BlockNumber blkno = key_get_blkno(iter->ts, key); @@ -545,14 +641,20 @@ TidStoreIterateNext(TidStoreIter *iter) */ iter->next_tidkey = key; iter->next_off_bitmap = off_bitmap; - return output; + break; } /* Collect tids decoded from the key and offset bitmap */ iter_decode_key_off(iter, key, off_bitmap); } - iter->finished = true; + if (!iter_found) + iter->finished = true; + +#ifdef TIDSTORE_DEBUG + ts_debug_iter_check_tids(iter); +#endif + return output; } @@ -699,3 +801,135 @@ encode_blk_off(TidStore *ts, BlockNumber block, OffsetNumber offset, return key; } + +#ifdef TIDSTORE_DEBUG +/* Comparator routines for ItemPointer */ +static int +itemptr_cmp(const void *left, const void *right) +{ + BlockNumber lblk, + rblk; + OffsetNumber loff, + roff; + + lblk = ItemPointerGetBlockNumber((ItemPointer) left); + rblk = ItemPointerGetBlockNumber((ItemPointer) right); + + if (lblk < rblk) + return -1; + if (lblk > rblk) + return 1; + + loff = ItemPointerGetOffsetNumber((ItemPointer) left); + roff = ItemPointerGetOffsetNumber((ItemPointer) right); + + if (loff < roff) + return -1; + if (loff > roff) + return 1; + + return 0; +} + +/* Insert tids to the tid array for debugging */ +static void +ts_debug_set_block_offsets(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets, + int num_offsets) +{ + if (ts->control->num_tids > 0 && + blkno < ItemPointerGetBlockNumber(&(ts->tids[ts->control->num_tids - 1]))) + { + /* The array will be sorted at ts_debug_is_member() */ + ts->control->tids_unordered = true; + } + + for (int i = 0; i < num_offsets; i++) + { + ItemPointer tid; + int idx = ts->control->num_tids + i; + + /* Enlarge the tid array if necessary */ + if (idx >= ts->control->max_tids) + { + ts->control->max_tids *= 2; + + if (TidStoreIsShared(ts)) + { + dsm_segment *new_seg = + dsm_create(sizeof(ItemPointerData) * ts->control->max_tids, 0); + ItemPointer new_tids = dsm_segment_address(new_seg); + + /* copy tids from old to new array */ + memcpy(new_tids, ts->tids, + sizeof(ItemPointerData) * (ts->control->max_tids / 2)); + + dsm_detach(ts->tids_seg); + ts->tids = new_tids; + } + else + ts->tids = repalloc(ts->tids, + sizeof(ItemPointerData) * ts->control->max_tids); + } + + tid = &(ts->tids[idx]); + ItemPointerSetBlockNumber(tid, blkno); + ItemPointerSetOffsetNumber(tid, offsets[i]); + } +} + +/* Return true if the given tid is present in the tid array */ +static bool +ts_debug_is_member(TidStore *ts, ItemPointer tid) +{ + int64 litem, + ritem, + item; + ItemPointer res; + + if (ts->control->num_tids == 0) + return false; + + /* Make sure the tid array is sorted */ + if (ts->control->tids_unordered) + { + qsort(ts->tids, ts->control->num_tids, sizeof(ItemPointerData), itemptr_cmp); + ts->control->tids_unordered = false; + } + + litem = itemptr_encode(&ts->tids[0]); + ritem = itemptr_encode(&ts->tids[ts->control->num_tids - 1]); + item = itemptr_encode(tid); + + /* + * Doing a simple bound check before bsearch() is useful to avoid the + * extra cost of bsearch(), especially if dead items on the heap are + * concentrated in a certain range. Since this function is called for + * every index tuple, it pays to be really fast. + */ + if (item < litem || item > ritem) + return false; + + res = bsearch(tid, ts->tids, ts->control->num_tids, sizeof(ItemPointerData), + itemptr_cmp); + + return (res != NULL); +} + +/* Verify if the iterator output matches the tids in the array for debugging */ +static void +ts_debug_iter_check_tids(TidStoreIter *iter) +{ + BlockNumber blkno = iter->output.blkno; + + for (int i = 0; i < iter->output.num_offsets; i++) + { + ItemPointer tid = &(iter->ts->tids[iter->tids_idx + i]); + + Assert((iter->tids_idx + i) < iter->ts->control->max_tids); + Assert(ItemPointerGetBlockNumber(tid) == blkno); + Assert(ItemPointerGetOffsetNumber(tid) == iter->output.offsets[i]); + } + + iter->tids_idx += iter->output.num_offsets; +} +#endif -- 2.31.1