(40af10b57 did this for tuplesort.c, this is the same, but for tuplestore.c)
I was looking at the tuplestore.c code a few days ago and noticed that it allocates tuples in the memory context that tuplestore_begin_heap() is called in, which for nodeMaterial.c, is ExecutorState. I didn't think this was great because: 1. Allocating many chunks in ExecutorState can bloat the context with many blocks worth of free'd chunks, stored on freelists that might never be reused for anything. 2. To clean up the memory, pfree must be meticulously called on each allocated tuple 3. ExecutorState is an aset.c context which isn't the most efficient allocator for this purpose. I've attached 2 patches: 0001: Adds memory tracking to Materialize nodes, which looks like: -> Materialize (actual time=0.033..9.157 rows=10000 loops=2) Storage: Memory Maximum Storage: 10441kB 0002: Creates a Generation MemoryContext for storing tuples in tuplestore. Using generation has the following advantages: 1. It does not round allocations up to the next power of 2. Using generation will save an average of 25% memory for tuplestores or allow an average of 25% more tuples before going to disk. 2. Allocation patterns in tuplestore.c are FIFO, which is exactly what generation was designed to handle best. 3. Generation is faster to palloc/pfree than aset. (See [1]. Compare the 4-bit times between aset_palloc_pfree.png and generation_palloc_pfree.png) 4. tuplestore_clear() and tuplestore_end() can reset or delete the tuple context instead of pfreeing every tuple one by one. 5. Higher likelihood of neighbouring tuples being stored consecutively in memory, resulting in better CPU memory prefetching. 6. Generation has a page-level freelist, so is able to reuse pages instead of freeing and mallocing another if tuplestore_trim() is used to continually remove no longer needed tuples. aset.c can only efficiently do this if the tuples are all in the same size class. The attached bench.sh.txt tests the performance of this change and result_chart.png shows the results I got when running on an AMD 3990x master @ 8f0a97dff vs patched. The script runs benchmarks for various tuple counts stored in the tuplestore -- 1 to 8192 in power-2 increments. The script does output the memory consumed by the tuplestore for each query. Here are the results for the 8192 tuple test: master @ 8f0a97dff Storage: Memory Maximum Storage: 16577kB patched: Storage: Memory Maximum Storage: 8577kB Which is roughly half, but I did pad the tuple to just over 1024 bytes, so the alloc set allocations would have rounded up to 2048 bytes. Some things I've *not* done: 1. Gone over other executor nodes which use tuplestore to add the same additional EXPLAIN output. CTE Scan, Recursive Union, Window Agg could get similar treatment. 2. Given much consideration for the block sizes to use for GenerationContextCreate(). (Maybe using ALLOCSET_SMALL_INITSIZE for the start size is a good idea.) 3. A great deal of testing. I'll park this here until we branch for v18. David [1] https://www.postgresql.org/message-id/CAApHDvqUWhOMkUjYXzq95idAwpiPdJLCxxRbf8kV6PYcW5y=c...@mail.gmail.com
#!/bin/bash db=postgres psql -c "drop table if exists t1;" $db psql -c "create table t1 (a int, b text);" $db psql -c "insert into t1 select x,repeat('a',1024) from generate_series(1,10000)x;" $db psql -c "create index on t1(a);" $db psql -c "vacuum freeze analyze t1;" $db psql -c "select pg_prewarm('t1');" $db for r in 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 do echo -n "$r rows " psql -c "explain (analyze) select count(t1.b) from (values(1),(2)) t2(x) left join (select * from t1 where a <= $r) t1 on true;" $db | grep "Maximum Storage" done for r in 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 do echo -n "Testing with $r rows " echo "select count(t1.b) from (values(1),(2)) t2(x) left join (select * from t1 where a <= $r) t1 on true;" > bench.sql pgbench -n -T 10 -f bench.sql -M prepared $db | grep latency done
v1-0001-Add-memory-disk-usage-for-Material-in-EXPLAIN-ANA.patch
Description: Binary data
v1-0002-Don-t-use-ExecutorState-memory-context-for-tuples.patch
Description: Binary data