Re: [PoC] Improve dead tuple storage for lazy vacuum

Masahiko Sawada Tue, 29 Nov 2022 23:28:47 -0800

On Fri, Nov 25, 2022 at 5:00 PM John Naylor
<john.nay...@enterprisedb.com> wrote:
>
> On Thu, Nov 24, 2022 at 9:54 PM Masahiko Sawada <sawada.m...@gmail.com> wrote:
> >
> > So it seems that there are two candidates of rt_node structure: (1)
> > all nodes except for node256 are variable-size nodes and use pointer
> > tagging, and (2) node32 and node128 are variable-sized nodes and do
> > not use pointer tagging (fanout is in part of only these two nodes).
> > rt_node can be 5 bytes in both cases. But before going to this step, I
> > started to verify the idea of variable-size nodes by using 6-bytes
> > rt_node. We can adjust the node kinds and node classes later.
>
> First, I'm glad you picked up the size class concept and expanded it. (I have 
> some comments about some internal APIs below.)
>
> Let's leave the pointer tagging piece out until the main functionality is 
> committed. We have all the prerequisites in place, except for a benchmark 
> random enough to demonstrate benefit. I'm still not quite satisfied with how 
> the shared memory coding looked, and that is the only sticky problem we still 
> have, IMO. The rest is "just work".
>
> That said, (1) and (2) above are still relevant -- variable sizing any given 
> node is optional, and we can refine as needed.
>
> > Overall, the idea of variable-sized nodes is good, smaller size
> > without losing search performance.
>
> Good.
>
> > I'm going to check the load
> > performance as well.
>
> Part of that is this, which gets called a lot more now, when node1 expands:
>
> + if (inner)
> + newnode = (rt_node *) MemoryContextAllocZero(tree->inner_slabs[kind],
> + rt_node_kind_info[kind].inner_size);
> + else
> + newnode = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[kind],
> + rt_node_kind_info[kind].leaf_size);
>
> Since memset for expanding size class is now handled separately, these can 
> use the non-zeroing versions. When compiling MemoryContextAllocZero, the 
> compiler has no idea how big the size is, so it assumes the worst and 
> optimizes for large sizes. On x86-64, that means using "rep stos", which 
> calls microcode found in the CPU's ROM. This is slow for small sizes. The 
> "init" function should be always inline with const parameters where possible. 
> That way, memset can compile to a single instruction for the smallest node 
> kind. (More on alloc/init below)


Right. I forgot to update it.

>
> Note, there is a wrinkle: As currently written inner_node128 searches the 
> child pointers for NULL when inserting, so when expanding from partial to 
> full size class, the new node must be zeroed (Worth fixing in the short term. 
> I thought of this while writing the proof-of-concept for size classes, but 
> didn't mention it.) Medium term, rather than special-casing this, I actually 
> want to rewrite the inner-node128 to be more similar to the leaf, with an 
> "isset" array, but accessed and tested differently. I guarantee it's *really* 
> slow now to load (maybe somewhat true even for leaves), but I'll leave the 
> details for later.

Agreed, I start with zeroing out the node when expanding from partial
to full size.

> Regarding node128 leaf, note that it's slightly larger than a DSA size class, 
> and we can trim it to fit:
>
> node61:  6 + 256+(2) +16 +  61*8 =  768
> node125: 6 + 256+(2) +16 + 125*8 = 1280

Agreed, changed.

>
> > I've attached the patches I used for the verification. I don't include
> > patches for pointer tagging, DSA support, and vacuum integration since
> > I'm investigating the issue on cfbot that Andres reported. Also, I've
> > modified tests to improve the test coverage.
>
> Sounds good. For v12, I think size classes have proven themselves, so v11's 
> 0002/4/5 can be squashed. Plus, some additional comments:
>
> +/* Return a new and initialized node */
> +static rt_node *
> +rt_alloc_init_node(radix_tree *tree, uint8 kind, uint8 shift, uint8 chunk, 
> bool inner)
> +{
> + rt_node *newnode;
> +
> + newnode = rt_alloc_node(tree, kind, inner);
> + rt_init_node(newnode, kind, shift, chunk, inner);
> +
> + return newnode;
> +}
>
> I don't see the point of a function that just calls two functions.

Removed.

>
> +/*
> + * Create a new node with 'new_kind' and the same shift, chunk, and
> + * count of 'node'.
> + */
> +static rt_node *
> +rt_grow_node(radix_tree *tree, rt_node *node, int new_kind)
> +{
> + rt_node    *newnode;
> +
> + newnode = rt_alloc_init_node(tree, new_kind, node->shift, node->chunk,
> + node->shift > 0);
> + newnode->count = node->count;
> +
> + return newnode;
> +}
>
> This, in turn, just calls a function that does _almost_ everything, and 
> additionally must set one member. This function should really be alloc-node + 
> init-node + copy-common, where copy-common is like in the prototype:
> + newnode->node_shift = oldnode->node_shift;
> + newnode->node_chunk = oldnode->node_chunk;
> + newnode->count = oldnode->count;
>
> And init-node should really be just memset + set kind + set initial fanout. 
> It has no business touching "shift" and "chunk". The callers rt_new_root, 
> rt_set_extend, and rt_extend set some values of their own anyway, so let them 
> set those, too -- it might even improve readability.
>
> -       if (n32->base.n.fanout == 
> rt_size_class_info[RT_CLASS_32_PARTIAL].fanout)
> +       if (NODE_NEEDS_TO_GROW_CLASS(n32, RT_CLASS_32_PARTIAL))

Agreed.

>
> This macro doesn't really improve readability -- it obscures what is being 
> tested, and the name implies the "else" branch means "node doesn't need to 
> grow class", which is false. If we want to simplify expressions in this 
> block, I think it'd be more effective to improve the lines that follow:
>
> + memcpy(new32, n32, rt_size_class_info[RT_CLASS_32_PARTIAL].inner_size);
> + new32->base.n.fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout;
>
> Maybe we can have const variables old_size and new_fanout to break out the 
> array lookup? While I'm thinking of it, these arrays should be const so the 
> compiler can avoid runtime lookups. Speaking of...
>
> +/* Copy both chunks and children/values arrays */
> +static inline void
> +chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children,
> +  uint8 *dst_chunks, rt_node **dst_children, int count)
> +{
> + /* For better code generation */
> + if (count > rt_node_kind_info[RT_NODE_KIND_4].fanout)
> + pg_unreachable();
> +
> + memcpy(dst_chunks, src_chunks, sizeof(uint8) * count);
> + memcpy(dst_children, src_children, sizeof(rt_node *) * count);
> +}
>
> When I looked at this earlier, I somehow didn't go far enough -- why are we 
> passing the runtime count in the first place? This function can only be 
> called if count == rt_size_class_info[RT_CLASS_4_FULL].fanout. The last 
> parameter to memcpy should evaluate to a compile-time constant, right? Even 
> when we add node shrinking in the future, the constant should be correct, 
> IIUC?

Right. We don't need to pass count to these functions.

>
> - .fanout = 256,
> + /* technically it's 256, but we can't store that in a uint8,
> +  and this is the max size class so it will never grow */
> + .fanout = 0,
>
> - Assert(chunk_exists || NODE_HAS_FREE_SLOT(n256));
> + Assert(((rt_node *) n256)->fanout == 0);
> + Assert(chunk_exists || ((rt_node *) n256)->count < 256);
>
> These hacks were my work, but I think we can improve that by having two 
> versions of NODE_HAS_FREE_SLOT -- one for fixed- and one for variable-sized 
> nodes. For that to work, in "init-node" we'd need a branch to set fanout to 
> zero for node256. That should be fine -- it already has to branch for 
> memset'ing node128's indexes to 0xFF.

Since the node has fanout regardless of fixed-sized and
variable-sized, only node256 is the special case where the fanout in
the node doesn't match the actual fanout of the node. I think if we
want to have two versions of NODE_HAS_FREE_SLOT, we can have one for
node256 and one for other classes. Thoughts? In your idea, for
NODE_HAS_FREE_SLOT for fixed-sized nodes, you meant like the
following?

#define FIXED_NODDE_HAS_FREE_SLOT(node, class)
  (node->base.n.count < rt_size_class_info[class].fanout)

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

Reply via email to