Hi, As part of my efforts to make partitioning scale better for larger numbers of partitions, I've been looking at primarily INSERT VALUES performance. Here the overheads are almost completely in the executor. Planning of this type of statement is very simple since there is no FROM clause to process.
My benchmarks have been around a RANGE partitioned table with 10k leaf partitions and no sub-partitioned tables. The partition key is a timestamp column. I've found that ExecSetupPartitionTupleRouting() is very slow indeed and there are a number of things slow about it. The biggest culprit for the slowness is the locking of each partition inside of find_all_inheritors(). For now, this needs to remain as we must hold locks on each partition while performing RelationBuildPartitionDesc(), otherwise, one of the partitions may get dropped out from under us. There might be other valid reasons too, but please see my note at the bottom of this email. The locking is not the only slow thing. I found the following to also be slow: 1. RelationGetPartitionDispatchInfo uses a List and lappend() must perform a palloc() each time a partition is added to the list. 2. A foreach loop is performed over leaf_parts to search for subplans belonging to this partition. This seems pointless to do for INSERTs as there's never any to find. 3. ExecCleanupTupleRouting() loops through the entire partitions array. If a single tuple was inserted then all but one of the elements will be NULL. 4. Tuple conversion map allocates an empty array thinking there might be something to put into it. This is costly when the array is large and pointless when there are no maps to store. 5. During get_partition_dispatch_recurse(), get_rel_relkind() is called to determine if the partition is a partitioned table or a leaf partition. This results in a slow relcache hashtable lookup. 6. get_partition_dispatch_recurse() also ends up just building the indexes array with a sequence of numbers from 0 to nparts - 1 if there are no sub-partitioned tables. Doing this is slow when there are many partitions. Besides the locking, the only thing that remains slow now is the palloc0() for the 'partitions' array. In my test, it takes 0.6% of execution time. I don't see any pretty ways to fix that. I've written fixes for items 1-6 above. I did: 1. Use an array instead of a List. 2. Don't do this loop. palloc0() the partitions array instead. Let UPDATE add whatever subplans exist to the zeroed array. 3. Track what we initialize in a gapless array and cleanup just those ones. Make this array small and increase it only when we need more space. 4. Only allocate the map array when we need to store a map. 5. Work that out in relcache beforehand. 6. ditto The only questionable thing I see is what I did for 6. In partcache.c I'm basically building an array of nparts containing 0 to nparts -1. It seems a bit pointless, so perhaps there's a better way. I was also a bit too tight to memcpy() that out of relcache, and just pointed directly to it. That might be a no-go area. I've attached 2 patches: 0001: implements items 1-6 0002: Is not intended for commit. It just a demo of where we could get the performance if we were smarter about locking partitions. I've just included this to show 0001's worth. Performance AWS: m5d.large fsync=off Unpatched: $ pgbench -n -T 60 -f partbench_insert.sql postgres transaction type: partbench_insert.sql scaling factor: 1 query mode: simple number of clients: 1 number of threads: 1 duration: 60 s number of transactions actually processed: 2836 latency average = 21.162 ms tps = 47.254409 (including connections establishing) tps = 47.255756 (excluding connections establishing) (yes, it's bad) 0001: $ pgbench -n -T 60 -f partbench_insert.sql postgres transaction type: partbench_insert.sql scaling factor: 1 query mode: simple number of clients: 1 number of threads: 1 duration: 60 s number of transactions actually processed: 3235 latency average = 18.548 ms tps = 53.913121 (including connections establishing) tps = 53.914629 (excluding connections establishing) (a small improvement from 0001) 0001+0002: $ pgbench -n -T 60 -f partbench_insert.sql postgres transaction type: partbench_insert.sql scaling factor: 1 query mode: simple number of clients: 1 number of threads: 1 duration: 60 s number of transactions actually processed: 660079 latency average = 0.091 ms tps = 11001.303764 (including connections establishing) tps = 11001.602377 (excluding connections establishing) (something to aspire towards) 0002 (only): $ pgbench -n -T 60 -f partbench_insert.sql postgres transaction type: partbench_insert.sql scaling factor: 1 query mode: simple number of clients: 1 number of threads: 1 duration: 60 s number of transactions actually processed: 27682 latency average = 2.168 ms tps = 461.350885 (including connections establishing) tps = 461.363327 (excluding connections establishing) (shows that doing 0002 alone does not fix all our problems) Unpartitioned table (control test): $ pgbench -n -T 60 -f partbench__insert.sql postgres transaction type: partbench__insert.sql scaling factor: 1 query mode: simple number of clients: 1 number of threads: 1 duration: 60 s number of transactions actually processed: 801260 latency average = 0.075 ms tps = 13354.311397 (including connections establishing) tps = 13354.656163 (excluding connections establishing) Test setup: CREATE TABLE partbench_ (date TIMESTAMP NOT NULL, i1 INT NOT NULL, i2 INT NOT NULL, i3 INT NOT NULL, i4 INT NOT NULL, i5 INT NOT NULL); CREATE TABLE partbench (date TIMESTAMP NOT NULL, i1 INT NOT NULL, i2 INT NOT NULL, i3 INT NOT NULL, i4 INT NOT NULL, i5 INT NOT NULL) PARTITION BY RANGE (date); \o /dev/null select 'CREATE TABLE partbench' || x::text || ' PARTITION OF partbench FOR VALUES FROM (''' || '2017-03-06'::date + (x::text || ' hours')::interval || ''') TO (''' || '2017-03-06'::date + ((x+1)::text || ' hours')::interval || ''');' from generate_Series(0,9999) x; \gexec \o partbench_insert.sql contains: insert into partbench values('2018-04-26 15:00:00',1,2,3,4,5); partbench__insert.sql contains: insert into partbench_ values('2018-04-26 15:00:00',1,2,3,4,5); I don't want to discuss the locking on this thread. That discussion will detract from discussing what I'm proposing here... Which is not to change anything relating to locks. I'm still working on that and will post elsewhere. Please start another thread if you'd like to discuss that in the meantime. Feel free to link it in here so others can follow. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
v1-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch
Description: Binary data
v1-0002-Unsafe-locking-reduction-for-partitioned-INSERT-U.patch
Description: Binary data