Re: [PoC] Improve dead tuple storage for lazy vacuum

Masahiko Sawada Tue, 04 Oct 2022 23:46:33 -0700

On Wed, Sep 28, 2022 at 12:49 PM Masahiko Sawada <sawada.m...@gmail.com> wrote:
>
> On Fri, Sep 23, 2022 at 12:11 AM John Naylor
> <john.nay...@enterprisedb.com> wrote:
> >
> >
> > On Thu, Sep 22, 2022 at 11:46 AM John Naylor <john.nay...@enterprisedb.com> 
> > wrote:
> > > One thing I want to try soon is storing fewer than 16/32 etc entries, so 
> > > that the whole node fits comfortably inside a power-of-two allocation. 
> > > That would allow us to use aset without wasting space for the smaller 
> > > nodes, which would be faster and possibly would solve the fragmentation 
> > > problem Andres referred to in
> >
> > > https://www.postgresql.org/message-id/20220704220038.at2ane5xkymzzssb%40awork3.anarazel.de
> >
> > While calculating node sizes that fit within a power-of-two size, I noticed 
> > the current base node is a bit wasteful, taking up 8 bytes. The node kind 
> > only has a small number of values, so it doesn't really make sense to use 
> > an enum here in the struct (in fact, Andres' prototype used a uint8 for 
> > node_kind). We could use a bitfield for the count and kind:
> >
> > uint16 -- kind and count bitfield
> > uint8 shift;
> > uint8 chunk;
> >
> > That's only 4 bytes. Plus, if the kind is ever encoded in a pointer tag, 
> > the bitfield can just go back to being count only.
>
> Good point, agreed.
>
> >
> > Here are the v6 node kinds:
> >
> > node4:   8 +   4 +(4)    +   4*8 =   48 bytes
> > node16:  8 +  16         +  16*8 =  152
> > node32:  8 +  32         +  32*8 =  296
> > node128: 8 + 256 + 128/8 + 128*8 = 1304
> > node256: 8       + 256/8 + 256*8 = 2088
> >
> > And here are the possible ways we could optimize nodes for space using aset 
> > allocation. Parentheses are padding bytes. Even if my math has mistakes, 
> > the numbers shouldn't be too far off:
> >
> > node3:   4 +   3 +(1)    +   3*8 =   32 bytes
> > node6:   4 +   6 +(6)    +   6*8 =   64
> > node13:  4 +  13 +(7)    +  13*8 =  128
> > node28:  4 +  28         +  28*8 =  256
> > node31:  4 + 256 +  32/8 +  31*8 =  512 (XXX not good)
> > node94:  4 + 256 +  96/8 +  94*8 = 1024
> > node220: 4 + 256 + 224/8 + 220*8 = 2048
> > node256:                         = 4096
> >
> > The main disadvantage is that node256 would balloon in size.
>
> Yeah, node31 and node256 are bloated.  We probably could use slab for
> node256 independently. It's worth trying a benchmark to see how it
> affects the performance and the tree size.
>
> BTW We need to consider not only aset/slab but also DSA since we
> allocate dead tuple TIDs on DSM in parallel vacuum cases. FYI DSA uses
> the following size classes:
>
> static const uint16 dsa_size_classes[] = {
>     sizeof(dsa_area_span), 0,   /* special size classes */
>     8, 16, 24, 32, 40, 48, 56, 64,  /* 8 classes separated by 8 bytes */
>     80, 96, 112, 128,           /* 4 classes separated by 16 bytes */
>     160, 192, 224, 256,         /* 4 classes separated by 32 bytes */
>     320, 384, 448, 512,         /* 4 classes separated by 64 bytes */
>     640, 768, 896, 1024,        /* 4 classes separated by 128 bytes */
>     1280, 1560, 1816, 2048,     /* 4 classes separated by ~256 bytes */
>     2616, 3120, 3640, 4096,     /* 4 classes separated by ~512 bytes */
>     5456, 6552, 7280, 8192      /* 4 classes separated by ~1024 bytes */
> };
>
> node256 will be classed as 2616, which is still not good.
>
> Anyway, I'll implement DSA support for radix tree.
>


Regarding DSA support, IIUC we need to use dsa_pointer in inner nodes
to point to its child nodes, instead of C pointers (ig, backend-local
address). I'm thinking of a straightforward approach as the first
step; inner nodes have a union of rt_node* and dsa_pointer and we
choose either one based on whether the radix tree is shared or not. We
allocate and free the shared memory for individual nodes by
dsa_allocate() and dsa_free(), respectively. Therefore we need to get
a C pointer from dsa_pointer by using dsa_get_address() while
descending the tree. I'm a bit concerned that calling
dsa_get_address() for every descent could be performance overhead but
I'm going to measure it anyway.

Regards,

--
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: [PoC] Improve dead tuple storage for lazy vacuum

Reply via email to