On Mon, Feb 20, 2023 at 2:56 PM Masahiko Sawada <sawada.m...@gmail.com> wrote:
>
> On Thu, Feb 16, 2023 at 6:23 PM John Naylor
> <john.nay...@enterprisedb.com> wrote:
> >
> > On Thu, Feb 16, 2023 at 10:24 AM Masahiko Sawada <sawada.m...@gmail.com> 
> > wrote:
> > >
> > > On Tue, Feb 14, 2023 at 8:24 PM John Naylor
> > > <john.nay...@enterprisedb.com> wrote:
> >
> > > > > I can think that something like traversing a HOT chain could visit
> > > > > offsets out of order. But fortunately we prune such collected TIDs
> > > > > before heap vacuum in heap case.
> > > >
> > > > Further, currently we *already* assume we populate the tid array in 
> > > > order (for binary search), so we can just continue assuming that (with 
> > > > an assert added since it's more public in this form). I'm not sure why 
> > > > such basic common sense evaded me a few versions ago...
> > >
> > > Right. TidStore is implemented not only for heap, so loading
> > > out-of-order TIDs might be important in the future.
> >
> > That's what I was probably thinking about some weeks ago, but I'm having a 
> > hard time imagining how it would come up, even for something like the 
> > conveyor-belt concept.
> >
> > > We have the following WIP comment in test_radixtree:
> > >
> > > // WIP: compiles with warnings because rt_attach is defined but not used
> > > // #define RT_SHMEM
> > >
> > > How about unsetting RT_SCOPE to suppress warnings for unused rt_attach
> > > and friends?
> >
> > Sounds good to me, and the other fixes make sense as well.
>
> Thanks, I merged them.
>
> >
> > > FYI I've briefly tested the TidStore with blocksize = 32kb, and it
> > > seems to work fine.
> >
> > That was on my list, so great! How about the other end -- nominally we 
> > allow 512b. (In practice it won't matter, but this would make sure I didn't 
> > mess anything up when forcing all MaxTuplesPerPage to encode.)
>
> According to the doc, the minimum block size is 1kB. It seems to work
> fine with 1kB blocks.
>
> >
> > > You removed the vacuum integration patch from v27, is there any reason 
> > > for that?
> >
> > Just an oversight.
> >
> > Now for some general comments on the tid store...
> >
> > + * TODO: The caller must be certain that no other backend will attempt to
> > + * access the TidStore before calling this function. Other backend must
> > + * explicitly call tidstore_detach to free up backend-local memory 
> > associated
> > + * with the TidStore. The backend that calls tidstore_destroy must not call
> > + * tidstore_detach.
> > + */
> > +void
> > +tidstore_destroy(TidStore *ts)
> >
> > Do we need to do anything for this todo?
>
> Since it's practically no problem, I think we can live with it for
> now. dshash also has the same todo.
>
> >
> > It might help readability to have a concept of "off_upper/off_lower", just 
> > so we can describe things more clearly. The key is block + off_upper, and 
> > the value is a bitmap of all the off_lower bits. I hinted at that in my 
> > addition of encode_key_off(). Along those lines, maybe 
> > s/TIDSTORE_OFFSET_MASK/TIDSTORE_OFFSET_LOWER_MASK/. Actually, I'm not even 
> > sure the TIDSTORE_ prefix is valuable for these local macros.
> >
> > The word "value" as a variable name is pretty generic in this context, and 
> > it might be better to call it the off_lower_bitmap, at least in some 
> > places. The "key" doesn't have a good short term for naming, but in 
> > comments we should make sure we're clear it's "block# + off_upper".
> >
> > I'm not a fan of the name "tid_i", even as a temp variable -- maybe 
> > "compressed_tid"?
> >
> > maybe s/tid_to_key_off/encode_tid/ and s/encode_key_off/encode_block_offset/
> >
> > It might be worth using typedefs for key and value type. Actually, since 
> > key type is fixed for the foreseeable future, maybe the radix tree template 
> > should define a key typedef?
> >
> > The term "result" is probably fine within the tidstore, but as a public 
> > name used by vacuum, it's not very descriptive. I don't have a good idea, 
> > though.
> >
> > Some files in backend/access use CamelCase for public functions, although 
> > it's not consistent. I think doing that for tidstore would help 
> > readability, since they would stand out from rt_* functions and vacuum 
> > functions. It's a matter of taste, though.
> >
> > I don't understand the control flow in tidstore_iterate_next(), or when 
> > BlockNumberIsValid() is true. If this is the best way to code this, it 
> > needs more commentary.
>
> The attached 0008 patch addressed all above comments on tidstore.
>
> > Some comments on vacuum:
> >
> > I think we'd better get some real-world testing of this, fairly soon.
> >
> > I had an idea: If it's not too much effort, it might be worth splitting it 
> > into two parts: one that just adds the store (not caring about its memory 
> > limits or progress reporting etc). During index scan, check both the new 
> > store and the array and log a warning (we don't want to exit or crash, 
> > better to try to investigate while live if possible) if the result doesn't 
> > match. Then perhaps set up an instance and let something like TPC-C run for 
> > a few days. The second patch would just restore the rest of the current 
> > patch. That would help reassure us it's working as designed.
>
> Yeah, I did a similar thing in an earlier version of tidstore patch.
> Since we're trying to introduce two new components: radix tree and
> tidstore, I sometimes find it hard to investigate failures happening
> during lazy (parallel) vacuum due to a bug either in tidstore or radix
> tree. If there is a bug in lazy vacuum, we cannot even do initdb. So
> it might be a good idea to do such checks in USE_ASSERT_CHECKING (or
> with another macro say DEBUG_TIDSTORE) builds. For example, TidStore
> stores tids to both the radix tree and array, and checks if the
> results match when lookup or iteration. It will use more memory but it
> would not be a big problem in USE_ASSERT_CHECKING builds. It would
> also be great if we can enable such checks on some bf animals.

I've tried this idea. Enabling this check on all debug builds (i.e.,
with USE_ASSERT_CHECKING macro) seems not a good idea so I use a
special macro for that, TIDSTORE_DEBUG. I think we can define this
macro on some bf animals (or possibly a new bf animal).

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
From 107aa2af2966c10ce750e6b410ae570462423aab Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <sawada.m...@gmail.com>
Date: Wed, 22 Feb 2023 14:43:15 +0900
Subject: [PATCH v29 11/11] Debug TIDStore.

---
 src/backend/access/common/tidstore.c | 242 ++++++++++++++++++++++++++-
 1 file changed, 238 insertions(+), 4 deletions(-)

diff --git a/src/backend/access/common/tidstore.c 
b/src/backend/access/common/tidstore.c
index 9360520482..438bf0c800 100644
--- a/src/backend/access/common/tidstore.c
+++ b/src/backend/access/common/tidstore.c
@@ -28,12 +28,20 @@
 #include "postgres.h"
 
 #include "access/tidstore.h"
+#include "catalog/index.h"
 #include "miscadmin.h"
 #include "port/pg_bitutils.h"
 #include "storage/lwlock.h"
 #include "utils/dsa.h"
 #include "utils/memutils.h"
 
+#define TIDSTORE_DEBUG
+
+/* Enable TidStore debugging only when USE_ASSERT_CHECKING */
+#if defined(TIDSTORE_DEBUG) && !defined(USE_ASSERT_CHECKING)
+#undef TIDSTORE_DEBUG
+#endif
+
 /*
  * For encoding purposes, a tid is represented as a pair of 64-bit key and
  * 64-bit value.
@@ -115,6 +123,12 @@ typedef struct TidStoreControl
        /* handles for TidStore and radix tree */
        TidStoreHandle          handle;
        shared_rt_handle        tree_handle;
+
+#ifdef TIDSTORE_DEBUG
+       dsm_handle              tids_handle;
+       int64                   max_tids;
+       bool                    tids_unordered;
+#endif
 } TidStoreControl;
 
 /* Per-backend state for a TidStore */
@@ -135,6 +149,11 @@ struct TidStore
 
        /* DSA area for TidStore if used */
        dsa_area        *area;
+
+#ifdef TIDSTORE_DEBUG
+       dsm_segment             *tids_seg;
+       ItemPointerData *tids;
+#endif
 };
 #define TidStoreIsShared(ts) ((ts)->area != NULL)
 
@@ -157,6 +176,11 @@ typedef struct TidStoreIter
        tidkey          next_tidkey;
        offsetbm        next_off_bitmap;
 
+#ifdef TIDSTORE_DEBUG
+       /* iterator index for the ts->tids array */
+       int64           tids_idx;
+#endif
+
        /*
         * output for the caller. Must be last because variable-size.
         */
@@ -169,6 +193,15 @@ static inline tidkey encode_blk_off(TidStore *ts, 
BlockNumber block,
                                                                        
OffsetNumber offset, offsetbm *off_bit);
 static inline tidkey encode_tid(TidStore *ts, ItemPointer tid, offsetbm 
*off_bit);
 
+/* debug functions available only when TIDSTORE_DEBUG */
+#ifdef TIDSTORE_DEBUG
+static void ts_debug_set_block_offsets(TidStore *ts, BlockNumber blkno, 
OffsetNumber *offsets,
+                                                                          int 
num_offsets);
+static void ts_debug_iter_check_tids(TidStoreIter *iter);
+static bool ts_debug_is_member(TidStore *ts, ItemPointer tid);
+static int itemptr_cmp(const void *left, const void *right);
+#endif
+
 /*
  * Create a TidStore. The returned object is allocated in backend-local memory.
  * The radix tree for storage is allocated in DSA area is 'area' is non-NULL.
@@ -237,6 +270,26 @@ TidStoreCreate(size_t max_bytes, int max_off, dsa_area 
*area)
        ts->control->upper_off_nbits =
                ts->control->max_off_nbits - LOWER_OFFSET_NBITS;
 
+#ifdef TIDSTORE_DEBUG
+       {
+               int64           max_tids = max_bytes / sizeof(ItemPointerData);
+
+               /* Allocate the array of tids too */
+               if (TidStoreIsShared(ts))
+               {
+                       ts->tids_seg = dsm_create(sizeof(ItemPointerData) * 
max_tids, 0);
+                       ts->tids = dsm_segment_address(ts->tids_seg);
+                       ts->control->tids_handle = 
dsm_segment_handle(ts->tids_seg);
+                       ts->control->max_tids = max_tids;
+               }
+               else
+               {
+                       ts->tids = palloc(sizeof(ItemPointerData) * max_tids);
+                       ts->control->max_tids = max_tids;
+               }
+       }
+#endif
+
        return ts;
 }
 
@@ -266,6 +319,11 @@ TidStoreAttach(dsa_area *area, TidStoreHandle handle)
        ts->tree.shared = shared_rt_attach(area, ts->control->tree_handle);
        ts->area = area;
 
+#ifdef TIDSTORE_DEBUG
+       ts->tids_seg = dsm_attach(ts->control->tids_handle);
+       ts->tids = (ItemPointer) dsm_segment_address(ts->tids_seg);
+#endif
+
        return ts;
 }
 
@@ -280,6 +338,11 @@ TidStoreDetach(TidStore *ts)
 {
        Assert(TidStoreIsShared(ts) && ts->control->magic == TIDSTORE_MAGIC);
 
+#ifdef TIDSTORE_DEBUG
+       if (TidStoreIsShared(ts))
+               dsm_detach(ts->tids_seg);
+#endif
+
        shared_rt_detach(ts->tree.shared);
        pfree(ts);
 }
@@ -315,6 +378,13 @@ TidStoreDestroy(TidStore *ts)
                local_rt_free(ts->tree.local);
        }
 
+#ifdef TIDSTORE_DEBUG
+       if (TidStoreIsShared(ts))
+               dsm_detach(ts->tids_seg);
+       else
+                pfree(ts->tids);
+#endif
+
        pfree(ts);
 }
 
@@ -434,6 +504,11 @@ TidStoreSetBlockOffsets(TidStore *ts, BlockNumber blkno, 
OffsetNumber *offsets,
                }
        }
 
+#ifdef TIDSTORE_DEBUG
+       /* Insert tids into the tid array too */
+       ts_debug_set_block_offsets(ts, blkno, offsets, num_offsets);
+#endif
+
        /* update statistics */
        ts->control->num_tids += num_offsets;
 
@@ -451,6 +526,11 @@ TidStoreIsMember(TidStore *ts, ItemPointer tid)
        offsetbm off_bitmap = 0;
        offsetbm off_bit;
        bool found;
+       bool ret;
+
+#ifdef TIDSTORE_DEBUG
+       bool ret_debug = ts_debug_is_member(ts, tid);
+#endif
 
        key = encode_tid(ts, tid, &off_bit);
 
@@ -460,9 +540,20 @@ TidStoreIsMember(TidStore *ts, ItemPointer tid)
                found = local_rt_search(ts->tree.local, key, &off_bitmap);
 
        if (!found)
+       {
+#ifdef TIDSTORE_DEBUG
+               Assert(!ret_debug);
+#endif
                return false;
+       }
+
+       ret = (off_bitmap & off_bit) != 0;
 
-       return (off_bitmap & off_bit) != 0;
+#ifdef TIDSTORE_DEBUG
+       Assert(ret == ret_debug);
+#endif
+
+       return ret;
 }
 
 /*
@@ -494,6 +585,10 @@ TidStoreBeginIterate(TidStore *ts)
        if (TidStoreNumTids(ts) == 0)
                iter->finished = true;
 
+#ifdef TIDSTORE_DEBUG
+       iter->tids_idx = 0;
+#endif
+
        return iter;
 }
 
@@ -515,6 +610,7 @@ TidStoreIterResult *
 TidStoreIterateNext(TidStoreIter *iter)
 {
        tidkey key;
+       bool    iter_found;
        offsetbm off_bitmap = 0;
        TidStoreIterResult *output = &(iter->output);
 
@@ -532,7 +628,7 @@ TidStoreIterateNext(TidStoreIter *iter)
        if (iter->next_off_bitmap > 0)
                iter_decode_key_off(iter, iter->next_tidkey, 
iter->next_off_bitmap);
 
-       while (tidstore_iter(iter, &key, &off_bitmap))
+       while ((iter_found = tidstore_iter(iter, &key, &off_bitmap)))
        {
                BlockNumber blkno = key_get_blkno(iter->ts, key);
 
@@ -545,14 +641,20 @@ TidStoreIterateNext(TidStoreIter *iter)
                         */
                        iter->next_tidkey = key;
                        iter->next_off_bitmap = off_bitmap;
-                       return output;
+                       break;
                }
 
                /* Collect tids decoded from the key and offset bitmap */
                iter_decode_key_off(iter, key, off_bitmap);
        }
 
-       iter->finished = true;
+       if (!iter_found)
+               iter->finished = true;
+
+#ifdef TIDSTORE_DEBUG
+       ts_debug_iter_check_tids(iter);
+#endif
+
        return output;
 }
 
@@ -699,3 +801,135 @@ encode_blk_off(TidStore *ts, BlockNumber block, 
OffsetNumber offset,
 
        return key;
 }
+
+#ifdef TIDSTORE_DEBUG
+/* Comparator routines for ItemPointer */
+static int
+itemptr_cmp(const void *left, const void *right)
+{
+       BlockNumber lblk,
+               rblk;
+       OffsetNumber loff,
+               roff;
+
+       lblk = ItemPointerGetBlockNumber((ItemPointer) left);
+       rblk = ItemPointerGetBlockNumber((ItemPointer) right);
+
+       if (lblk < rblk)
+               return -1;
+       if (lblk > rblk)
+               return 1;
+
+       loff = ItemPointerGetOffsetNumber((ItemPointer) left);
+       roff = ItemPointerGetOffsetNumber((ItemPointer) right);
+
+       if (loff < roff)
+               return -1;
+       if (loff > roff)
+               return 1;
+
+       return 0;
+}
+
+/* Insert tids to the tid array for debugging */
+static void
+ts_debug_set_block_offsets(TidStore *ts, BlockNumber blkno, OffsetNumber 
*offsets,
+                                                  int num_offsets)
+{
+       if (ts->control->num_tids > 0 &&
+               blkno < 
ItemPointerGetBlockNumber(&(ts->tids[ts->control->num_tids - 1])))
+       {
+               /* The array will be sorted at ts_debug_is_member() */
+               ts->control->tids_unordered = true;
+       }
+
+       for (int i = 0; i < num_offsets; i++)
+       {
+               ItemPointer tid;
+               int idx = ts->control->num_tids + i;
+
+               /* Enlarge the tid array if necessary */
+               if (idx >= ts->control->max_tids)
+               {
+                       ts->control->max_tids *= 2;
+
+                       if (TidStoreIsShared(ts))
+                       {
+                               dsm_segment *new_seg =
+                                       dsm_create(sizeof(ItemPointerData) * 
ts->control->max_tids, 0);
+                               ItemPointer new_tids = 
dsm_segment_address(new_seg);
+
+                               /* copy tids from old to new array */
+                               memcpy(new_tids, ts->tids,
+                                          sizeof(ItemPointerData) * 
(ts->control->max_tids / 2));
+
+                               dsm_detach(ts->tids_seg);
+                               ts->tids = new_tids;
+                       }
+                       else
+                               ts->tids = repalloc(ts->tids,
+                                                                       
sizeof(ItemPointerData) * ts->control->max_tids);
+               }
+
+               tid = &(ts->tids[idx]);
+               ItemPointerSetBlockNumber(tid, blkno);
+               ItemPointerSetOffsetNumber(tid, offsets[i]);
+       }
+}
+
+/* Return true if the given tid is present in the tid array */
+static bool
+ts_debug_is_member(TidStore *ts, ItemPointer tid)
+{
+       int64   litem,
+               ritem,
+               item;
+       ItemPointer res;
+
+       if (ts->control->num_tids == 0)
+               return false;
+
+       /* Make sure the tid array is sorted */
+       if (ts->control->tids_unordered)
+       {
+               qsort(ts->tids, ts->control->num_tids, sizeof(ItemPointerData), 
itemptr_cmp);
+               ts->control->tids_unordered = false;
+       }
+
+       litem = itemptr_encode(&ts->tids[0]);
+       ritem = itemptr_encode(&ts->tids[ts->control->num_tids - 1]);
+       item = itemptr_encode(tid);
+
+       /*
+        * Doing a simple bound check before bsearch() is useful to avoid the
+        * extra cost of bsearch(), especially if dead items on the heap are
+        * concentrated in a certain range.     Since this function is called 
for
+        * every index tuple, it pays to be really fast.
+        */
+       if (item < litem || item > ritem)
+               return false;
+
+       res = bsearch(tid, ts->tids, ts->control->num_tids, 
sizeof(ItemPointerData),
+                                 itemptr_cmp);
+
+       return (res != NULL);
+}
+
+/* Verify if the iterator output matches the tids in the array for debugging */
+static void
+ts_debug_iter_check_tids(TidStoreIter *iter)
+{
+       BlockNumber blkno = iter->output.blkno;
+
+       for (int i = 0; i < iter->output.num_offsets; i++)
+       {
+               ItemPointer tid = &(iter->ts->tids[iter->tids_idx + i]);
+
+               Assert((iter->tids_idx + i) < iter->ts->control->max_tids);
+               Assert(ItemPointerGetBlockNumber(tid) == blkno);
+               Assert(ItemPointerGetOffsetNumber(tid) == 
iter->output.offsets[i]);
+       }
+
+       iter->tids_idx += iter->output.num_offsets;
+}
+#endif
-- 
2.31.1

Reply via email to