Re: slab allocator performance issues

David Rowley Mon, 05 Dec 2022 00:02:40 -0800

On Fri, 11 Nov 2022 at 22:20, John Naylor <john.nay...@enterprisedb.com> wrote:
>
>
> On Wed, Oct 12, 2022 at 4:37 PM David Rowley <dgrowle...@gmail.com> wrote:
> > [v3]
>
> + /*
> + * Compute a shift that guarantees that shifting chunksPerBlock with it
> + * yields is smaller than SLAB_FREELIST_COUNT - 1 (one freelist is used for 
> full blocks).
> + */
> + slab->freelist_shift = 0;
> + while ((slab->chunksPerBlock >> slab->freelist_shift) >= 
> (SLAB_FREELIST_COUNT - 1))
> + slab->freelist_shift++;
>
> + /*
> + * Ensure, without a branch, that index 0 is only used for blocks entirely
> + * without free chunks.
> + * XXX: There probably is a cheaper way to do this. Needing to shift twice
> + * by slab->freelist_shift isn't great.
> + */
> + index = (freecount + (1 << slab->freelist_shift) - 1) >> 
> slab->freelist_shift;
>
> How about something like
>
> #define SLAB_FREELIST_COUNT ((1<<3) + 1)
> index = (freecount & (SLAB_FREELIST_COUNT - 2)) + (freecount != 0);


Doesn't this create a sort of round-robin use of the free list?  What
we want is a sort of "histogram" bucket set of free lists so we can
group together blocks that have a close-enough free number of chunks.
Unless I'm mistaken, I think what you have doesn't do that.

I wondered if simply:

index = -(-freecount >> slab->freelist_shift);

would be faster than Andres' version.  I tried it out and on my AMD
machine, it's about the same speed. Same on a Raspberry Pi 4.

Going by [2], the instructions are very different with each method, so
other machines with different latencies on those instructions might
show something different. I attached what I used to test if anyone
else wants a go.

AMD Zen2
$ ./freecount 2000000000
Test 'a' in 0.922766 seconds
Test 'd' in 0.922762 seconds (0.000433% faster)

RPI4
$ ./freecount 2000000000
Test 'a' in 3.341350 seconds
Test 'd' in 3.338690 seconds (0.079672% faster)

That was gcc. Trying it with clang, it went in a little heavy-handed
and optimized out my loop, so some more trickery might be needed for a
useful test on that compiler.

David

[2] https://godbolt.org/z/dh95TohEG

#include <stdio.h>
#include <time.h>
#include <stdlib.h>

#define SLAB_FREELIST_COUNT 9

int main(int argc, char **argv)
{
        clock_t start, end;
        double v1_time, v2_time;
        int freecount;
        int freelist_shift = 0;
        int chunksPerBlock;
        int index;
        int zerocount = 0;
        
        if (argc < 2)
        {
                printf("Syntax: %s <chunksPerBlock>\n", argv[0]);
                return -1;
        }
        
        chunksPerBlock = atoi(argv[1]);
        printf("chunksPerBlock = %d\n", chunksPerBlock);
        
        while ((chunksPerBlock >> freelist_shift) >= (SLAB_FREELIST_COUNT - 1))
                freelist_shift++;
        printf("freelist_shift = %d\n", freelist_shift);
        
        start = clock();
        
        for (freecount = 0; freecount <= chunksPerBlock; freecount++)
        {
                index = (freecount + (1 << freelist_shift) - 1) >> 
freelist_shift;
                
                /* try to prevent optimizing the above out */
                if (index == 0)
                        zerocount++;
        }
        
        
        end = clock();
        v1_time = (double) (end - start) / CLOCKS_PER_SEC;
        printf("Test 'a' in %f seconds\n", v1_time);
        printf("zerocount = %d\n", zerocount);
        
        zerocount = 0;
        start = clock();

        for (freecount = 0; freecount <= chunksPerBlock; freecount++)
        {
                index = -(-freecount >> freelist_shift);

                /* try to prevent optimizing the above out */
                if (index == 0)
                        zerocount++;
                
        }

        end = clock();
        v2_time = (double) (end - start) / CLOCKS_PER_SEC;
        printf("Test 'd' in %f seconds (%f%% faster)\n", v2_time, v1_time / 
v2_time * 100.0 - 100.0);
        printf("zerocount = %d\n", zerocount);
        return 0;
}

Re: slab allocator performance issues

Reply via email to