On Wed, Sep 28, 2022 at 12:49 PM Masahiko Sawada <sawada.m...@gmail.com> wrote: > > On Fri, Sep 23, 2022 at 12:11 AM John Naylor > <john.nay...@enterprisedb.com> wrote: > > > > > > On Thu, Sep 22, 2022 at 11:46 AM John Naylor <john.nay...@enterprisedb.com> > > wrote: > > > One thing I want to try soon is storing fewer than 16/32 etc entries, so > > > that the whole node fits comfortably inside a power-of-two allocation. > > > That would allow us to use aset without wasting space for the smaller > > > nodes, which would be faster and possibly would solve the fragmentation > > > problem Andres referred to in > > > > > https://www.postgresql.org/message-id/20220704220038.at2ane5xkymzzssb%40awork3.anarazel.de > > > > While calculating node sizes that fit within a power-of-two size, I noticed > > the current base node is a bit wasteful, taking up 8 bytes. The node kind > > only has a small number of values, so it doesn't really make sense to use > > an enum here in the struct (in fact, Andres' prototype used a uint8 for > > node_kind). We could use a bitfield for the count and kind: > > > > uint16 -- kind and count bitfield > > uint8 shift; > > uint8 chunk; > > > > That's only 4 bytes. Plus, if the kind is ever encoded in a pointer tag, > > the bitfield can just go back to being count only. > > Good point, agreed. > > > > > Here are the v6 node kinds: > > > > node4: 8 + 4 +(4) + 4*8 = 48 bytes > > node16: 8 + 16 + 16*8 = 152 > > node32: 8 + 32 + 32*8 = 296 > > node128: 8 + 256 + 128/8 + 128*8 = 1304 > > node256: 8 + 256/8 + 256*8 = 2088 > > > > And here are the possible ways we could optimize nodes for space using aset > > allocation. Parentheses are padding bytes. Even if my math has mistakes, > > the numbers shouldn't be too far off: > > > > node3: 4 + 3 +(1) + 3*8 = 32 bytes > > node6: 4 + 6 +(6) + 6*8 = 64 > > node13: 4 + 13 +(7) + 13*8 = 128 > > node28: 4 + 28 + 28*8 = 256 > > node31: 4 + 256 + 32/8 + 31*8 = 512 (XXX not good) > > node94: 4 + 256 + 96/8 + 94*8 = 1024 > > node220: 4 + 256 + 224/8 + 220*8 = 2048 > > node256: = 4096 > > > > The main disadvantage is that node256 would balloon in size. > > Yeah, node31 and node256 are bloated. We probably could use slab for > node256 independently. It's worth trying a benchmark to see how it > affects the performance and the tree size. > > BTW We need to consider not only aset/slab but also DSA since we > allocate dead tuple TIDs on DSM in parallel vacuum cases. FYI DSA uses > the following size classes: > > static const uint16 dsa_size_classes[] = { > sizeof(dsa_area_span), 0, /* special size classes */ > 8, 16, 24, 32, 40, 48, 56, 64, /* 8 classes separated by 8 bytes */ > 80, 96, 112, 128, /* 4 classes separated by 16 bytes */ > 160, 192, 224, 256, /* 4 classes separated by 32 bytes */ > 320, 384, 448, 512, /* 4 classes separated by 64 bytes */ > 640, 768, 896, 1024, /* 4 classes separated by 128 bytes */ > 1280, 1560, 1816, 2048, /* 4 classes separated by ~256 bytes */ > 2616, 3120, 3640, 4096, /* 4 classes separated by ~512 bytes */ > 5456, 6552, 7280, 8192 /* 4 classes separated by ~1024 bytes */ > }; > > node256 will be classed as 2616, which is still not good. > > Anyway, I'll implement DSA support for radix tree. >
Regarding DSA support, IIUC we need to use dsa_pointer in inner nodes to point to its child nodes, instead of C pointers (ig, backend-local address). I'm thinking of a straightforward approach as the first step; inner nodes have a union of rt_node* and dsa_pointer and we choose either one based on whether the radix tree is shared or not. We allocate and free the shared memory for individual nodes by dsa_allocate() and dsa_free(), respectively. Therefore we need to get a C pointer from dsa_pointer by using dsa_get_address() while descending the tree. I'm a bit concerned that calling dsa_get_address() for every descent could be performance overhead but I'm going to measure it anyway. Regards, -- Masahiko Sawada PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com