On Fri, Nov 25, 2022 at 5:00 PM John Naylor <john.nay...@enterprisedb.com> wrote: > > On Thu, Nov 24, 2022 at 9:54 PM Masahiko Sawada <sawada.m...@gmail.com> wrote: > > > > So it seems that there are two candidates of rt_node structure: (1) > > all nodes except for node256 are variable-size nodes and use pointer > > tagging, and (2) node32 and node128 are variable-sized nodes and do > > not use pointer tagging (fanout is in part of only these two nodes). > > rt_node can be 5 bytes in both cases. But before going to this step, I > > started to verify the idea of variable-size nodes by using 6-bytes > > rt_node. We can adjust the node kinds and node classes later. > > First, I'm glad you picked up the size class concept and expanded it. (I have > some comments about some internal APIs below.) > > Let's leave the pointer tagging piece out until the main functionality is > committed. We have all the prerequisites in place, except for a benchmark > random enough to demonstrate benefit. I'm still not quite satisfied with how > the shared memory coding looked, and that is the only sticky problem we still > have, IMO. The rest is "just work". > > That said, (1) and (2) above are still relevant -- variable sizing any given > node is optional, and we can refine as needed. > > > Overall, the idea of variable-sized nodes is good, smaller size > > without losing search performance. > > Good. > > > I'm going to check the load > > performance as well. > > Part of that is this, which gets called a lot more now, when node1 expands: > > + if (inner) > + newnode = (rt_node *) MemoryContextAllocZero(tree->inner_slabs[kind], > + rt_node_kind_info[kind].inner_size); > + else > + newnode = (rt_node *) MemoryContextAllocZero(tree->leaf_slabs[kind], > + rt_node_kind_info[kind].leaf_size); > > Since memset for expanding size class is now handled separately, these can > use the non-zeroing versions. When compiling MemoryContextAllocZero, the > compiler has no idea how big the size is, so it assumes the worst and > optimizes for large sizes. On x86-64, that means using "rep stos", which > calls microcode found in the CPU's ROM. This is slow for small sizes. The > "init" function should be always inline with const parameters where possible. > That way, memset can compile to a single instruction for the smallest node > kind. (More on alloc/init below)
Right. I forgot to update it. > > Note, there is a wrinkle: As currently written inner_node128 searches the > child pointers for NULL when inserting, so when expanding from partial to > full size class, the new node must be zeroed (Worth fixing in the short term. > I thought of this while writing the proof-of-concept for size classes, but > didn't mention it.) Medium term, rather than special-casing this, I actually > want to rewrite the inner-node128 to be more similar to the leaf, with an > "isset" array, but accessed and tested differently. I guarantee it's *really* > slow now to load (maybe somewhat true even for leaves), but I'll leave the > details for later. Agreed, I start with zeroing out the node when expanding from partial to full size. > Regarding node128 leaf, note that it's slightly larger than a DSA size class, > and we can trim it to fit: > > node61: 6 + 256+(2) +16 + 61*8 = 768 > node125: 6 + 256+(2) +16 + 125*8 = 1280 Agreed, changed. > > > I've attached the patches I used for the verification. I don't include > > patches for pointer tagging, DSA support, and vacuum integration since > > I'm investigating the issue on cfbot that Andres reported. Also, I've > > modified tests to improve the test coverage. > > Sounds good. For v12, I think size classes have proven themselves, so v11's > 0002/4/5 can be squashed. Plus, some additional comments: > > +/* Return a new and initialized node */ > +static rt_node * > +rt_alloc_init_node(radix_tree *tree, uint8 kind, uint8 shift, uint8 chunk, > bool inner) > +{ > + rt_node *newnode; > + > + newnode = rt_alloc_node(tree, kind, inner); > + rt_init_node(newnode, kind, shift, chunk, inner); > + > + return newnode; > +} > > I don't see the point of a function that just calls two functions. Removed. > > +/* > + * Create a new node with 'new_kind' and the same shift, chunk, and > + * count of 'node'. > + */ > +static rt_node * > +rt_grow_node(radix_tree *tree, rt_node *node, int new_kind) > +{ > + rt_node *newnode; > + > + newnode = rt_alloc_init_node(tree, new_kind, node->shift, node->chunk, > + node->shift > 0); > + newnode->count = node->count; > + > + return newnode; > +} > > This, in turn, just calls a function that does _almost_ everything, and > additionally must set one member. This function should really be alloc-node + > init-node + copy-common, where copy-common is like in the prototype: > + newnode->node_shift = oldnode->node_shift; > + newnode->node_chunk = oldnode->node_chunk; > + newnode->count = oldnode->count; > > And init-node should really be just memset + set kind + set initial fanout. > It has no business touching "shift" and "chunk". The callers rt_new_root, > rt_set_extend, and rt_extend set some values of their own anyway, so let them > set those, too -- it might even improve readability. > > - if (n32->base.n.fanout == > rt_size_class_info[RT_CLASS_32_PARTIAL].fanout) > + if (NODE_NEEDS_TO_GROW_CLASS(n32, RT_CLASS_32_PARTIAL)) Agreed. > > This macro doesn't really improve readability -- it obscures what is being > tested, and the name implies the "else" branch means "node doesn't need to > grow class", which is false. If we want to simplify expressions in this > block, I think it'd be more effective to improve the lines that follow: > > + memcpy(new32, n32, rt_size_class_info[RT_CLASS_32_PARTIAL].inner_size); > + new32->base.n.fanout = rt_size_class_info[RT_CLASS_32_FULL].fanout; > > Maybe we can have const variables old_size and new_fanout to break out the > array lookup? While I'm thinking of it, these arrays should be const so the > compiler can avoid runtime lookups. Speaking of... > > +/* Copy both chunks and children/values arrays */ > +static inline void > +chunk_children_array_copy(uint8 *src_chunks, rt_node **src_children, > + uint8 *dst_chunks, rt_node **dst_children, int count) > +{ > + /* For better code generation */ > + if (count > rt_node_kind_info[RT_NODE_KIND_4].fanout) > + pg_unreachable(); > + > + memcpy(dst_chunks, src_chunks, sizeof(uint8) * count); > + memcpy(dst_children, src_children, sizeof(rt_node *) * count); > +} > > When I looked at this earlier, I somehow didn't go far enough -- why are we > passing the runtime count in the first place? This function can only be > called if count == rt_size_class_info[RT_CLASS_4_FULL].fanout. The last > parameter to memcpy should evaluate to a compile-time constant, right? Even > when we add node shrinking in the future, the constant should be correct, > IIUC? Right. We don't need to pass count to these functions. > > - .fanout = 256, > + /* technically it's 256, but we can't store that in a uint8, > + and this is the max size class so it will never grow */ > + .fanout = 0, > > - Assert(chunk_exists || NODE_HAS_FREE_SLOT(n256)); > + Assert(((rt_node *) n256)->fanout == 0); > + Assert(chunk_exists || ((rt_node *) n256)->count < 256); > > These hacks were my work, but I think we can improve that by having two > versions of NODE_HAS_FREE_SLOT -- one for fixed- and one for variable-sized > nodes. For that to work, in "init-node" we'd need a branch to set fanout to > zero for node256. That should be fine -- it already has to branch for > memset'ing node128's indexes to 0xFF. Since the node has fanout regardless of fixed-sized and variable-sized, only node256 is the special case where the fanout in the node doesn't match the actual fanout of the node. I think if we want to have two versions of NODE_HAS_FREE_SLOT, we can have one for node256 and one for other classes. Thoughts? In your idea, for NODE_HAS_FREE_SLOT for fixed-sized nodes, you meant like the following? #define FIXED_NODDE_HAS_FREE_SLOT(node, class) (node->base.n.count < rt_size_class_info[class].fanout) Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com