On Fri, 3 May 2024 at 15:55, David Rowley <dgrowle...@gmail.com> wrote:
>
> (40af10b57 did this for tuplesort.c, this is the same, but for tuplestore.c)
>
> I was looking at the tuplestore.c code a few days ago and noticed that
> it allocates tuples in the memory context that tuplestore_begin_heap()
> is called in, which for nodeMaterial.c, is ExecutorState.
>
> I didn't think this was great because:
> 1. Allocating many chunks in ExecutorState can bloat the context with
> many blocks worth of free'd chunks, stored on freelists that might
> never be reused for anything.
> 2. To clean up the memory, pfree must be meticulously called on each
> allocated tuple
> 3. ExecutorState is an aset.c context which isn't the most efficient
> allocator for this purpose.

Agreed on all counts.

> I've attached 2 patches:
>
> 0001:  Adds memory tracking to Materialize nodes, which looks like:
>
>          ->  Materialize (actual time=0.033..9.157 rows=10000 loops=2)
>                Storage: Memory  Maximum Storage: 10441kB
>
> 0002: Creates a Generation MemoryContext for storing tuples in tuplestore.
>
> Using generation has the following advantages:

[...]
> 6. Generation has a page-level freelist, so is able to reuse pages
> instead of freeing and mallocing another if tuplestore_trim() is used
> to continually remove no longer needed tuples. aset.c can only
> efficiently do this if the tuples are all in the same size class.

Was a bump context considered? If so, why didn't it make the cut?
If tuplestore_trim is the only reason why the type of context in patch
2 is a generation context, then couldn't we make the type of context
conditional on state->eflags & EXEC_FLAG_REWIND, and use a bump
context if we require rewind capabilities (i.e. where _trim is never
effectively executed)?

> master @ 8f0a97dff
> Storage: Memory  Maximum Storage: 16577kB
>
> patched:
> Storage: Memory  Maximum Storage: 8577kB

Those are some impressive numbers.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)


On Fri, 3 May 2024 at 15:55, David Rowley <dgrowle...@gmail.com> wrote:
>
> (40af10b57 did this for tuplesort.c, this is the same, but for tuplestore.c)
>
> I was looking at the tuplestore.c code a few days ago and noticed that
> it allocates tuples in the memory context that tuplestore_begin_heap()
> is called in, which for nodeMaterial.c, is ExecutorState.
>
> I didn't think this was great because:
> 1. Allocating many chunks in ExecutorState can bloat the context with
> many blocks worth of free'd chunks, stored on freelists that might
> never be reused for anything.
> 2. To clean up the memory, pfree must be meticulously called on each
> allocated tuple
> 3. ExecutorState is an aset.c context which isn't the most efficient
> allocator for this purpose.
>
> I've attached 2 patches:
>
> 0001:  Adds memory tracking to Materialize nodes, which looks like:
>
>          ->  Materialize (actual time=0.033..9.157 rows=10000 loops=2)
>                Storage: Memory  Maximum Storage: 10441kB
>
> 0002: Creates a Generation MemoryContext for storing tuples in tuplestore.
>
> Using generation has the following advantages:
>
> 1. It does not round allocations up to the next power of 2.  Using
> generation will save an average of 25% memory for tuplestores or allow
> an average of 25% more tuples before going to disk.
> 2. Allocation patterns in tuplestore.c are FIFO, which is exactly what
> generation was designed to handle best.
> 3. Generation is faster to palloc/pfree than aset. (See [1]. Compare
> the 4-bit times between aset_palloc_pfree.png and
> generation_palloc_pfree.png)
> 4. tuplestore_clear() and tuplestore_end() can reset or delete the
> tuple context instead of pfreeing every tuple one by one.
> 5. Higher likelihood of neighbouring tuples being stored consecutively
> in memory, resulting in better CPU memory prefetching.
> 6. Generation has a page-level freelist, so is able to reuse pages
> instead of freeing and mallocing another if tuplestore_trim() is used
> to continually remove no longer needed tuples. aset.c can only
> efficiently do this if the tuples are all in the same size class.
>
> The attached bench.sh.txt tests the performance of this change and
> result_chart.png shows the results I got when running on an AMD 3990x
> master @ 8f0a97dff vs patched.
> The script runs benchmarks for various tuple counts stored in the
> tuplestore -- 1 to 8192 in power-2 increments.
>
> The script does output the memory consumed by the tuplestore for each
> query.  Here are the results for the 8192 tuple test:
>
> master @ 8f0a97dff
> Storage: Memory  Maximum Storage: 16577kB
>
> patched:
> Storage: Memory  Maximum Storage: 8577kB
>
> Which is roughly half, but I did pad the tuple to just over 1024
> bytes, so the alloc set allocations would have rounded up to 2048
> bytes.
>
> Some things I've *not* done:
>
> 1. Gone over other executor nodes which use tuplestore to add the same
> additional EXPLAIN output.  CTE Scan, Recursive Union, Window Agg
> could get similar treatment.
> 2. Given much consideration for the block sizes to use for
> GenerationContextCreate(). (Maybe using ALLOCSET_SMALL_INITSIZE for
> the start size is a good idea.)
> 3. A great deal of testing.
>
> I'll park this here until we branch for v18.
>
> David
>
> [1] 
> https://www.postgresql.org/message-id/CAApHDvqUWhOMkUjYXzq95idAwpiPdJLCxxRbf8kV6PYcW5y=c...@mail.gmail.com


Reply via email to