As a follow-on from [1] and also discussed in [2], I'd like to propose that we don't obtain locks on all partitions during INSERT into a partitioned table and UPDATE of a partitioned key and instead, only lock the partition when we first route a tuple to it. This means that the order that the locks are obtained is no longer well defined and is at the mercy of the order that tuples are INSERTed or UPDATEd. It seems worth relaxing this a bit for gains in performance, as when a partitioned table contains many partitions, the overhead of locking all partitions when inserting a single row, or just a few rows is often significantly higher than the cost of doing the actual insert.
The current behaviour was added in 54cde0c4c058073 in order to minimise deadlock risk. It seems that the risk there only comes from AELs that could be taken when a partition directly receives a TRUNCATE / CREATE INDEX / VACUUM FULL / CLUSTER. There's obviously no conflict with other DML operations since two RowExclusiveLocks don't conflict with each other. I think all other AEL obtaining DDL must be performed on the top level partitioned table, for example, ADD COLUMN can't be done directly on a partition, so there's no added deadlock risk from those. For a deadlock to occur one of the above DDL commands would have to be executed inside a transaction in an order opposite to the order rows are being INSERTed or UPDATEd in the partitioned table. If required, such operations could LOCK TABLE the top partitioned table to block the DML operation. There's already a risk of similar deadlocks from such operations done on multiple separate tables when the order they're done is not the same as the order the tables are written in a query, although, in that case, the window for the deadlock is likely to be much smaller. With this done, the performance of an INSERT into a 10k partition partitioned table looks like: Setup: create table hashp (a int) partition by hash(a); select 'create table hashp'||x::Text || ' partition of hashp for values with (modulus 10000, remainder '||x::text||');' from generate_Series(0,9999) x; \gexec hashp_insert.sql: \set p_a random(1,1000) insert into hashp values(:p_a); Results: $ psql -c "truncate hashp;" postgres && pgbench -n -f hashp_insert.sql -M prepared -c 4 -j 4 -T 60 postgres Patched: tps = 27811.427620 (excluding connections establishing) tps = 28617.417308 (excluding connections establishing) Unpatched: tps = 130.446706 (excluding connections establishing) tps = 119.726641 (excluding connections establishing) The patch is attached. I'll park this here until the January commitfest. [1] https://www.postgresql.org/message-id/flat/CAKJS1f_1RJyFquuCKRFHTdcXqoPX-PYqAd7nz=gvbwvgh4a...@mail.gmail.com [2] https://www.postgresql.org/message-id/flat/25C1C6B2E7BE044889E4FE8643A58BA963B5796B%40G01JPEXMBKW03#3783596a794c6789a54a95a20971b6aa -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
v1-0001-Delay-locking-of-partitions-during-INSERT-and-UPD.patch
Description: Binary data