I didn't get any closer to radix-tree regression, but I did find some inefficiencies in tidstore_add_tids() that are worth talking about first, addressed in a rough fashion in the attached .txt addendums that I can clean up and incorporate later.
To start, I can reproduce the regression with this test as well: select * from bench_tidstore_load(0, 10 * 1000 * 1000); v15 + v26 store + adjustments: mem_allocated | load_ms ---------------+--------- 98202152 | 1676 v26 0001-0008 mem_allocated | load_ms ---------------+--------- 98202032 | 1826 ...and reverting to the alternate way to update the parent didn't help: v26 0001-6, 0008, insert_inner w/ null parent mem_allocated | load_ms ---------------+--------- 98202032 | 1825 ...and I'm kind of glad that wasn't the problem, because going back to that would be a pain for the shmem case. Running perf doesn't show anything much different in the proportions (note that rt_set must have been inlined when declared locally in v26): v15 + v26 store + adjustments: 65.88% postgres postgres [.] tidstore_add_tids 10.74% postgres postgres [.] rt_set 9.20% postgres postgres [.] palloc0 6.49% postgres postgres [.] rt_node_insert_leaf v26 0001-0008 78.50% postgres postgres [.] tidstore_add_tids 8.88% postgres postgres [.] palloc0 6.24% postgres postgres [.] local_rt_node_insert_leaf v2699-0001: The first thing I noticed is that palloc0 is taking way more time than it should, and it's because the compiler doesn't know the values[] array is small. One reason we need to zero the array is to make the algorithm agnostic about what order the offsets come in, as I requested in a previous review. Thinking some more, I was way too paranoid about that. As long as access methods scan the line pointer array in the usual way, maybe we can just assert that the keys we create are in order, and zero any unused array entries as we find them. (I admit I can't actually think of a reason we would ever encounter offsets out of order.) Also, we can keep track of the last key we need to consider for insertion into the radix tree, and ignore the rest. That might shave a few cycles during the exclusive lock when the max offset of an LP_DEAD item < 64 on a given page, which I think would be common in the wild. I also got rid of the special case for non-encoding, since shifting by zero should work the same way. These together led to a nice speedup on the v26 branch: mem_allocated | load_ms ---------------+--------- 98202032 | 1386 v2699-0002: The next thing I noticed is forming a full ItemIdPointer to pass to tid_to_key_off(). That's bad for tidstore_add_tids() because ItemPointerSetBlockNumber() must do this in order to allow the struct to be SHORTALIGN'd: static inline void BlockIdSet(BlockIdData *blockId, BlockNumber blockNumber) { blockId->bi_hi = blockNumber >> 16; blockId->bi_lo = blockNumber & 0xffff; } Then, tid_to_key_off() calls ItemPointerGetBlockNumber(), which must reverse the above process: static inline BlockNumber BlockIdGetBlockNumber(const BlockIdData *blockId) { return (((BlockNumber) blockId->bi_hi) << 16) | ((BlockNumber) blockId->bi_lo); } There is no reason to do any of this if we're not reading/writing directly to/from an on-disk tid etc. To avoid this, I created a new function encode_key_off() [name could be better], which deals with the raw block number that we already have. Then turn tid_to_key_off() into a wrapper around that, since we still need the full conversion for tidstore_lookup_tid(). v2699-0003: Get rid of all the remaining special cases for encoding/or not. I am unaware of the need to optimize that case or treat it in any way differently. I haven't tested this on an installation with non-default blocksize and didn't measure this separately, but 0002+0003 gives: mem_allocated | load_ms ---------------+--------- 98202032 | 1259 If these are acceptable, I can incorporate them into a later patchset. In any case, speeding up tidstore_add_tids() will make any regressions in the backing radix tree more obvious. I will take a look at that next week. -- John Naylor EDB: http://www.enterprisedb.com
From 6bdd33fa4f55757b54d16ce00dc60a21b929606e Mon Sep 17 00:00:00 2001 From: John Naylor <john.nay...@postgresql.org> Date: Sat, 11 Feb 2023 10:45:21 +0700 Subject: [PATCH v2699 2/3] Do less work when encoding key/value --- src/backend/access/common/tidstore.c | 25 +++++++++++++++---------- 1 file changed, 15 insertions(+), 10 deletions(-) diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c index 5d24680737..3d384cf645 100644 --- a/src/backend/access/common/tidstore.c +++ b/src/backend/access/common/tidstore.c @@ -159,6 +159,7 @@ typedef struct TidStoreIter static void tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val); static inline BlockNumber key_get_blkno(TidStore *ts, uint64 key); +static inline uint64 encode_key_off(TidStore *ts, BlockNumber block, uint32 offset, uint32 *off); static inline uint64 tid_to_key_off(TidStore *ts, ItemPointer tid, uint32 *off); /* @@ -367,7 +368,6 @@ void tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets, int num_offsets) { - ItemPointerData tid; uint64 *values; uint64 key; uint64 prev_key; @@ -381,16 +381,12 @@ tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets, values = palloc(sizeof(uint64) * nkeys); key = prev_key = key_base; - ItemPointerSetBlockNumber(&tid, blkno); - for (int i = 0; i < num_offsets; i++) { uint32 off; - ItemPointerSetOffsetNumber(&tid, offsets[i]); - /* encode the tid to key and val */ - key = tid_to_key_off(ts, &tid, &off); + key = encode_key_off(ts, blkno, offsets[i], &off); /* make sure we scanned the line pointer array in order */ Assert(key >= prev_key); @@ -681,20 +677,29 @@ key_get_blkno(TidStore *ts, uint64 key) /* Encode a tid to key and offset */ static inline uint64 tid_to_key_off(TidStore *ts, ItemPointer tid, uint32 *off) +{ + uint32 offset = ItemPointerGetOffsetNumber(tid); + BlockNumber block = ItemPointerGetBlockNumber(tid); + + return encode_key_off(ts, block, offset, off); +} + +/* encode a block and offset to a key and partial offset */ +static inline uint64 +encode_key_off(TidStore *ts, BlockNumber block, uint32 offset, uint32 *off) { uint64 key; uint64 tid_i; if (!ts->control->encode_tids) { - *off = ItemPointerGetOffsetNumber(tid); + *off = offset; /* Use the block number as the key */ - return (int64) ItemPointerGetBlockNumber(tid); + return (int64) block; } - tid_i = ItemPointerGetOffsetNumber(tid); - tid_i |= (uint64) ItemPointerGetBlockNumber(tid) << ts->control->offset_nbits; + tid_i = offset | ((uint64) block << ts->control->offset_nbits); *off = tid_i & ((UINT64CONST(1) << TIDSTORE_VALUE_NBITS) - 1); key = tid_i >> TIDSTORE_VALUE_NBITS; -- 2.39.1
From c0bc497f50318c8e31ccdf0c2a9186ffc736abeb Mon Sep 17 00:00:00 2001 From: John Naylor <john.nay...@postgresql.org> Date: Fri, 10 Feb 2023 19:56:01 +0700 Subject: [PATCH v2699 1/3] Miscellaneous optimizations for tidstore_add_tids() - remove palloc0; it's expensive for lengths not known at compile-time - optimize for case with only one key per heap block - make some intializations const and branch-free - when writing to the radix tree, stop at the last non-zero bitmap --- src/backend/access/common/tidstore.c | 56 ++++++++++++++++++---------- 1 file changed, 36 insertions(+), 20 deletions(-) diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c index 4c72673ce9..5d24680737 100644 --- a/src/backend/access/common/tidstore.c +++ b/src/backend/access/common/tidstore.c @@ -368,51 +368,67 @@ tidstore_add_tids(TidStore *ts, BlockNumber blkno, OffsetNumber *offsets, int num_offsets) { ItemPointerData tid; - uint64 key_base; uint64 *values; - int nkeys; + uint64 key; + uint64 prev_key; + uint64 off_bitmap = 0; + int idx; + const uint64 key_base = ((uint64) blkno) << ts->control->offset_key_nbits; + const int nkeys = UINT64CONST(1) << ts->control->offset_key_nbits; Assert(!TidStoreIsShared(ts) || ts->control->magic == TIDSTORE_MAGIC); - if (ts->control->encode_tids) - { - key_base = ((uint64) blkno) << ts->control->offset_key_nbits; - nkeys = UINT64CONST(1) << ts->control->offset_key_nbits; - } - else - { - key_base = (uint64) blkno; - nkeys = 1; - } - values = palloc0(sizeof(uint64) * nkeys); + values = palloc(sizeof(uint64) * nkeys); + key = prev_key = key_base; ItemPointerSetBlockNumber(&tid, blkno); + for (int i = 0; i < num_offsets; i++) { - uint64 key; uint32 off; - int idx; ItemPointerSetOffsetNumber(&tid, offsets[i]); /* encode the tid to key and val */ key = tid_to_key_off(ts, &tid, &off); - idx = key - key_base; - Assert(idx >= 0 && idx < nkeys); + /* make sure we scanned the line pointer array in order */ + Assert(key >= prev_key); - values[idx] |= UINT64CONST(1) << off; + if (key > prev_key) + { + idx = prev_key - key_base; + Assert(idx >= 0 && idx < nkeys); + + /* write out offset bitmap for this key */ + values[idx] = off_bitmap; + + /* zero out any gaps up to the current key */ + for (int empty_idx = idx + 1; empty_idx < key - key_base; empty_idx++) + values[empty_idx] = 0; + + /* reset for current key -- the current offset will be handled below */ + off_bitmap = 0; + prev_key = key; + } + + off_bitmap |= UINT64CONST(1) << off; } + /* save the final index for later */ + idx = key - key_base; + /* write out last offset bitmap */ + values[idx] = off_bitmap; + if (TidStoreIsShared(ts)) LWLockAcquire(&ts->control->lock, LW_EXCLUSIVE); /* insert the calculated key-values to the tree */ - for (int i = 0; i < nkeys; i++) + for (int i = 0; i <= idx; i++) { if (values[i]) { - uint64 key = key_base + i; + key = key_base + i; if (TidStoreIsShared(ts)) shared_rt_set(ts->tree.shared, key, &values[i]); -- 2.39.1
From 82c1f639aaa64cc943af3b53294a63d5d8f7a9b9 Mon Sep 17 00:00:00 2001 From: John Naylor <john.nay...@postgresql.org> Date: Sat, 11 Feb 2023 11:51:32 +0700 Subject: [PATCH v2699 3/3] Force all callers to encode, no matter how small the expected offset --- src/backend/access/common/tidstore.c | 36 +++++----------------------- 1 file changed, 6 insertions(+), 30 deletions(-) diff --git a/src/backend/access/common/tidstore.c b/src/backend/access/common/tidstore.c index 3d384cf645..ff8e66936e 100644 --- a/src/backend/access/common/tidstore.c +++ b/src/backend/access/common/tidstore.c @@ -99,7 +99,6 @@ typedef struct TidStoreControl size_t max_bytes; /* the maximum bytes a TidStore can use */ int max_offset; /* the maximum offset number */ int offset_nbits; /* the number of bits required for max_offset */ - bool encode_tids; /* do we use tid encoding? */ int offset_key_nbits; /* the number of bits of a offset number * used for the key */ @@ -224,21 +223,15 @@ tidstore_create(size_t max_bytes, int max_offset, dsa_area *area) ts->control->max_offset = max_offset; ts->control->offset_nbits = pg_ceil_log2_32(max_offset); + if (ts->control->offset_nbits < TIDSTORE_VALUE_NBITS) + ts->control->offset_nbits = TIDSTORE_VALUE_NBITS; + /* * We use tid encoding if the number of bits for the offset number doesn't * fix in a value, uint64. */ - if (ts->control->offset_nbits > TIDSTORE_VALUE_NBITS) - { - ts->control->encode_tids = true; - ts->control->offset_key_nbits = - ts->control->offset_nbits - TIDSTORE_VALUE_NBITS; - } - else - { - ts->control->encode_tids = false; - ts->control->offset_key_nbits = 0; - } + ts->control->offset_key_nbits = + ts->control->offset_nbits - TIDSTORE_VALUE_NBITS; return ts; } @@ -643,12 +636,6 @@ tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val) uint64 tid_i; OffsetNumber off; - if (i > iter->ts->control->max_offset) - { - Assert(!iter->ts->control->encode_tids); - break; - } - if ((val & (UINT64CONST(1) << i)) == 0) continue; @@ -668,10 +655,7 @@ tidstore_iter_extract_tids(TidStoreIter *iter, uint64 key, uint64 val) static inline BlockNumber key_get_blkno(TidStore *ts, uint64 key) { - if (ts->control->encode_tids) - return (BlockNumber) (key >> ts->control->offset_key_nbits); - - return (BlockNumber) key; + return (BlockNumber) (key >> ts->control->offset_key_nbits); } /* Encode a tid to key and offset */ @@ -691,14 +675,6 @@ encode_key_off(TidStore *ts, BlockNumber block, uint32 offset, uint32 *off) uint64 key; uint64 tid_i; - if (!ts->control->encode_tids) - { - *off = offset; - - /* Use the block number as the key */ - return (int64) block; - } - tid_i = offset | ((uint64) block << ts->control->offset_nbits); *off = tid_i & ((UINT64CONST(1) << TIDSTORE_VALUE_NBITS) - 1); -- 2.39.1