Hi, Currently, we need to scan the WHOLE shared buffers when VACUUM truncated off any empty pages at end of transaction or when relation is TRUNCATEd. As for our customer case, we periodically truncate thousands of tables, and it's possible to TRUNCATE single table per transaction. This can be problematic later on during recovery which could take longer, especially when a sudden failover happens after those TRUNCATEs and when we have to scan a large-sized shared buffer. In the performance test below, it took almost 12.5 minutes for recovery to complete for 100GB shared buffers. But we want to keep failover very short (within 10 seconds).
Previously, I made an improvement in speeding the truncates of relation forks from 3 scans to one scan.[1] This time, the aim of this patch is to further speedup the invalidation of pages, by linking the cached pages of the target relation in a doubly-linked list and just traversing it instead of scanning the whole shared buffers. In DropRelFileNodeBuffers, we just get the number of target buffers to invalidate for the relation. There is a significant win in this patch, because we were able to complete failover and recover in 3 seconds more or less. I performed similar tests to what I did in the speedup truncates of relations forks.[1][2] However, this time using 100GB shared_buffers. [Machine spec used in testing] Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz CPU: 16, Number of cores per socket: 8 RHEL6.5, Memory: 256GB++ [Test] 1. (Master) Create table (ex. 10,000 tables). Insert data to tables. 2. (Master) DELETE FROM TABLE (ex. all rows of 10,000 tables) (Standby) To test with failover, pause the WAL replay on standby server. (SELECT pg_wal_replay_pause();) 3. (M) psql -c "\timing on" (measures total execution of SQL queries) 4. (M) VACUUM (whole db) 5. (M) Stop primary server. pg_ctl stop -D $PGDATA -w 6. (S) Resume wal replay and promote standby.[2] [Results] A. HEAD (origin/master branch) A1. Vacuum execution on Primary server Time: 730932.408 ms (12:10.932) ~12min 11s A2. Vacuum + Failover (WAL Recovery on Standby) waiting for server to promote........................... .................................... stopped waiting pg_ctl: server did not promote in time 2019/10/25_12:13:09.692─┐ 2019/10/25_12:25:43.576─┘ -->Total: 12min34s B. PATCH B1. Vacuum execution on Primary/Master Time: 6.518333s = 6518.333 ms B2. Vacuum + Failover (WAL Recovery on Standby) 2019/10/25_14:17:21.822 waiting for server to promote...... done server promoted 2019/10/25_14:17:24.827 2019/10/25_14:17:24.833 -->Total: 3.011s [Other Notes] Maybe one disadvantage is that we can have a variable number of relations, and allocated the same number of relation structures as the size of shared buffers. I tried to reduce the use of memory when doing hash table lookup operation by having a fixed size array (100) or threshold of target buffers to invalidate. When doing CachedBufLookup() to scan the count of each buffer in the dlist, I made sure to reduce the number of scans (2x at most). First, we scan the dlist of cached buffers of relations. Then store the target buffers in buf_id_array. Non-target buffers would be removed from dlist but added to temporary dlist. After reaching end of main dlist, we append the temporary dlist to tail of main dlist. I also performed pgbench buffer test, and this patch did not cause overhead to normal DB access performance. Another one that I'd need feedback of is the use of new dlist operations for this cached buffer list. I did not use in this patch the existing Postgres dlist architecture (ilist.h) because I want to save memory space as much as possible especially when NBuffers become large. Both dlist_node & dlist_head are 16 bytes. OTOH, two int pointers for this patch is 8 bytes. Hope to hear your feedback and comments. Thanks in advance, Kirk Jamison [1] https://www.postgresql.org/message-id/flat/D09B13F772D2274BB348A310EE3027C64E2067%40g01jpexmbkw24 [2] https://www.postgresql.org/message-id/D09B13F772D2274BB348A310EE3027C6502672%40g01jpexmbkw24
v1-Optimize-dropping-of-relation-buffers-using-dlist.patch
Description: v1-Optimize-dropping-of-relation-buffers-using-dlist.patch