Re: [PoC] Improve dead tuple storage for lazy vacuum

Masahiko Sawada Mon, 12 Jun 2023 22:47:55 -0700

On Tue, Jun 6, 2023 at 2:13 PM John Naylor <john.nay...@enterprisedb.com> wrote:
>
> On Mon, Jun 5, 2023 at 5:32 PM Masahiko Sawada <sawada.m...@gmail.com> wrote:
> >
> > > Sometime in the not-too-distant future, I will start a new thread 
> > > focusing on bitmap heap scan, but for now, I just want to share some 
> > > progress on making the radix tree usable not only for that, but hopefully 
> > > a wider range of applications, while making the code simpler and the 
> > > binary smaller. The attached patches are incomplete (e.g. no iteration) 
> > > and quite a bit messy, so tar'd and gzip'd for the curious (should apply 
> > > on top of v32 0001-03 + 0007-09 ).
> > >
> >
> > Thank you for making progress on this. I agree with these directions
> > overall. I have some comments and questions:
>
> Glad to hear it and thanks for looking!
>
> > > * Other nodes will follow in due time, but only after I figure out how to 
> > > do it nicely (ideas welcome!) -- currently node32's two size classes work 
> > > fine for growing, but the code should be simplified before extending to 
> > > other cases.)
> >
> > Within the size class, we just alloc a new node of lower size class
> > and do memcpy(). I guess it will be almost same as what we do for
> > growing.
>
> Oh, the memcpy part is great, very simple. I mean the (compile-time) "class 
> info" table lookups are a bit awkward. I'm thinking the hard-coded numbers 
> like this:
>
> .fanout = 3,
> .inner_size = sizeof(RT_NODE_INNER_3) + 3 * sizeof(RT_PTR_ALLOC),
>
> ...may be better with a #defined symbol that can also be used elsewhere.


FWIW, exposing these definitions would be good in terms of testing too
since we can use them in regression tests.

>
> > I don't think
> > shrinking class-3 to class-1 makes sense.
>
> Agreed. The smallest kind should just be freed when empty.
>
> > > Limited support for "combined pointer-value slots". At compile-time, 
> > > choose either that or "single-value leaves" based on the size of the 
> > > value type template parameter. Values that are pointer-sized or less can 
> > > fit in the last-level child slots of nominal "inner nodes" without 
> > > duplicated leaf-node code. Node256 now must act like the previous 
> > > 'node256 leaf', since zero is a valid value. Aside from that, this was a 
> > > small change.
> >
> > Yes, but it also means that we use pointer-sized value anyway even if
> > the value size is less than that, which wastes the memory, no?
>
> At a low-level, that makes sense, but I've found an interesting global effect 
> showing the opposite: _less_ memory, which may compensate:
>
> psql -c "select * from bench_search_random_nodes(1*1000*1000)"
> num_keys = 992660
>
> (using a low enough number that the experimental change n125->n63 doesn't 
> affect anything)
> height = 4, n3 = 375258, n15 = 137490, n32 = 0, n63 = 0, n256 = 1025
>
> v31:
>  mem_allocated | load_ms | search_ms
> ---------------+---------+-----------
>       47800768 |     253 |       134
>
> (unreleased code "similar" to v33, but among other things restores the 
> separate "extend down" function)
>  mem_allocated | load_ms | search_ms
> ---------------+---------+-----------
>       42926048 |     221 |       127
>
> I'd need to make sure, but apparently just going from 6 non-empty memory 
> contexts to 3 (remember all values are embedded here) reduces memory 
> fragmentation significantly in this test. (That should also serve as a 
> demonstration that additional size classes have both runtime costs as well as 
> benefits. We need to have a balance.)

Interesting. The result would probably vary if we change the slab
block sizes. I'd like to experiment if the code is available.

>
> So, I'm inclined to think the only reason to prefer "multi-value leaves" is 
> if 1) the value type is _bigger_ than a pointer 2) there is no convenient 
> abbreviation (like tid bitmaps have) and 3) the use case really needs to 
> avoid another memory access. Under those circumstances, though, the new code 
> plus lazy expansion etc might suit and be easier to maintain.

Indeed.

>
> > > What I've shared here could work (in principal, since it uses uint64 
> > > values) for tidstore, possibly faster (untested) because of better code 
> > > density, but as mentioned I want to shoot for higher. For tidbitmap.c, I 
> > > want to extend this idea and branch at run-time on a per-value basis, so 
> > > that a page-table entry that fits in a pointer can go there, and if not, 
> > > it'll be a full leaf. (This technique enables more flexibility in 
> > > lossifying pages as well.) Run-time info will require e.g. an additional 
> > > bit per slot. Since the node header is now 3 bytes, we can spare one more 
> > > byte in the node3 case. In addition, we can and should also bump it back 
> > > up to node4, still keeping the metadata within 8 bytes (no struct 
> > > padding).
> >
> > Sounds good.
>
> The additional bit per slot would require per-node logic and additional 
> branches, which is not great. I'm now thinking a much easier way to get there 
> is to give up (at least for now) on promising that "run-time embeddable 
> values" can use the full pointer-size (unlike value types found embeddable at 
> compile-time). Reserving the lowest pointer bit for a tag "value or 
> pointer-to-leaf" would have a much smaller code footprint.

Do you mean we can make sure that the value doesn't set the lowest
bit? Or is it an optimization for TIDStore?

> In addition, without a new bitmap, the smallest node can actually be up to a 
> node5 with no struct padding, with a node2 as a subclass. (Those numbers 
> coincidentally were also one scenario in the paper, when calculating 
> worst-case memory usage). That's worth considering.

Agreed.

FWIW please let me know if there are some experiments I can help with.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

Reply via email to