Hey David,

> I think you'd have to batch by filenode and transaction in that case. Each 
> batch might be pretty small on a typical OLTP workload, so it might not help 
> much there, or it might hinder.

True, it is very workload dependent (I was chasing mainly INSERTs multiValues, 
INSERT-SELECT) that often hit the same $block, certainly not OLTP. I would even 
say that INSERT-as-SELECT would be more suited for DWH-like processing.

> But having said that, I don't think any of those possibilities should stop us 
> speeding up smgropen().

Of course! I've tried a couple of much more smaller ideas, but without any big 
gains. I was able to squeeze like 300-400k function calls per second (WAL 
records/s), that was the point I think where I think smgropen() got abused. 

> > Another potential option that we've discussed is that the redo generation
> itself is likely a brake of efficient recovery performance today (e.g. INSERT-
> SELECT on table with indexes, generates interleaved WAL records that touch
> often limited set of blocks that usually put Smgr into spotlight).
> 
> I'm not quite sure if I understand what you mean here.  Is this queuing up
> WAL records up during transactions and flush them out to WAL every so
> often after rearranging them into an order that's more optimal for replay?

Why not both? 😉 We were very concentrated on standby side, but on primary side 
one could also change how WAL records are generated:

1) Minimalization of records towards same repeated $block eg. Heap2 
table_multi_insert() API already does this and it matters to generate more 
optimal stream for replay:

postgres@test=# create table t (id bigint primary key);
postgres@test=# insert into t select generate_series(1, 10);

results in many calls due to interleave heap with btree records for the same 
block from Smgr perspective (this is especially visible on highly indexed 
tables)  =>      
rmgr: Btree       len (rec/tot):     64/    64, tx:   17243284, lsn: 
4/E7000108, prev 4/E70000A0, desc: INSERT_LEAF off 1, blkref #0: rel 
1663/16384/32794 blk 1
rmgr: Heap        len (rec/tot):     63/    63, tx:   17243284, lsn: 
4/E7000148, prev 4/E7000108, desc: INSERT off 2 flags 0x00, blkref #0: rel 
1663/16384/32791 blk 0
rmgr: Btree       len (rec/tot):     64/    64, tx:   17243284, lsn: 
4/E7000188, prev 4/E7000148, desc: INSERT_LEAF off 2, blkref #0: rel 
1663/16384/32794 blk 1
rmgr: Heap        len (rec/tot):     63/    63, tx:   17243284, lsn: 
4/E70001C8, prev 4/E7000188, desc: INSERT off 3 flags 0x00, blkref #0: rel 
1663/16384/32791 blk 0
rmgr: Btree       len (rec/tot):     64/    64, tx:   17243284, lsn: 
4/E7000208, prev 4/E70001C8, desc: INSERT_LEAF off 3, blkref #0: rel 
1663/16384/32794 blk 1
rmgr: Heap        len (rec/tot):     63/    63, tx:   17243284, lsn: 
4/E7000248, prev 4/E7000208, desc: INSERT off 4 flags 0x00, blkref #0: rel 
1663/16384/32791 blk 0
rmgr: Btree       len (rec/tot):     64/    64, tx:   17243284, lsn: 
4/E7000288, prev 4/E7000248, desc: INSERT_LEAF off 4, blkref #0: rel 
1663/16384/32794 blk 1
rmgr: Heap        len (rec/tot):     63/    63, tx:   17243284, lsn: 
4/E70002C8, prev 4/E7000288, desc: INSERT off 5 flags 0x00, blkref #0: rel 
1663/16384/32791 blk 0
[..]
Similar stuff happens for UPDATE. It basically prevents recent-buffer 
optimization that avoid repeated calls to smgropen().

And here's already existing table_multi_inserts v2 API (Heap2) sample with 
obvious elimination of unnecessary individual calls to smgopen() via one big 
MULTI_INSERT instead (for CTAS/COPY/REFRESH MV) :
postgres@test=# create table t (id bigint primary key);
postgres@test=# copy (select generate_series (1, 10)) to '/tmp/t';
postgres@test=# copy t from '/tmp/t';
=>
rmgr: Heap2       len (rec/tot):    210/   210, tx:   17243290, lsn: 
4/E9000028, prev 4/E8004410, desc: MULTI_INSERT+INIT 10 tuples flags 0x02, 
blkref #0: rel 1663/16384/32801 blk 0
rmgr: Btree       len (rec/tot):    102/   102, tx:   17243290, lsn: 
4/E9000100, prev 4/E9000028, desc: NEWROOT lev 0, blkref #0: rel 
1663/16384/32804 blk 1, blkref #2: rel 1663/16384/32804 blk 0
rmgr: Btree       len (rec/tot):     64/    64, tx:   17243290, lsn: 
4/E9000168, prev 4/E9000100, desc: INSERT_LEAF off 1, blkref #0: rel 
1663/16384/32804 blk 1
rmgr: Btree       len (rec/tot):     64/    64, tx:   17243290, lsn: 
4/E90001A8, prev 4/E9000168, desc: INSERT_LEAF off 2, blkref #0: rel 
1663/16384/32804 blk 1
rmgr: Btree       len (rec/tot):     64/    64, tx:   17243290, lsn: 
4/E90001E8, prev 4/E90001A8, desc: INSERT_LEAF off 3, blkref #0: rel 
1663/16384/32804 blk 1
[..]
Here Btree it is very localized (at least when concurrent sessions are not 
generating WAL) and it enables Thomas's recent-buffer to kick in

DELETE is much more simple (thanks to not chewing out those Btree records) and 
also thanks to Thomas's recent-buffer should theoretically put much less stress 
on smgropen() already:
rmgr: Heap        len (rec/tot):     54/    54, tx:   17243296, lsn: 
4/ED000028, prev 4/EC002800, desc: DELETE off 1 flags 0x00 KEYS_UPDATED , 
blkref #0: rel 1663/16384/32808 blk 0
rmgr: Heap        len (rec/tot):     54/    54, tx:   17243296, lsn: 
4/ED000060, prev 4/ED000028, desc: DELETE off 2 flags 0x00 KEYS_UPDATED , 
blkref #0: rel 1663/16384/32808 blk 0
rmgr: Heap        len (rec/tot):     54/    54, tx:   17243296, lsn: 
4/ED000098, prev 4/ED000060, desc: DELETE off 3 flags 0x00 KEYS_UPDATED , 
blkref #0: rel 1663/16384/32808 blk 0
rmgr: Heap        len (rec/tot):     54/    54, tx:   17243296, lsn: 
4/ED0000D0, prev 4/ED000098, desc: DELETE off 4 flags 0x00 KEYS_UPDATED , 
blkref #0: rel 1663/16384/32808 blk 0
[..]

2) So what's missing - I may be wrong on this one - something like 
"index_multi_inserts" Btree2 API to avoid repeatedly overwhelming smgropen() on 
recovery side for same index's $buffer. Not sure it is worth the effort, though 
especially recent-buffer fixes that:
rmgr: Btree       len (rec/tot):     64/    64, tx:   17243290, lsn: 
4/E9000168, prev 4/E9000100, desc: INSERT_LEAF off 1, blkref #0: rel 
1663/16384/32804 blk 1
rmgr: Btree       len (rec/tot):     64/    64, tx:   17243290, lsn: 
4/E90001A8, prev 4/E9000168, desc: INSERT_LEAF off 2, blkref #0: rel 
1663/16384/32804 blk 1
rmgr: Btree       len (rec/tot):     64/    64, tx:   17243290, lsn: 
4/E90001E8, prev 4/E90001A8, desc: INSERT_LEAF off 3, blkref #0: rel 
1663/16384/32804 blk 1
right?

3) Concurrent DML sessions mixing WAL records: the buffering on backend's side 
of things (on private "thread" of WAL - in private memory - that would be 
simply "copied" into logwriter's main WAL buffer when committing/buffer full) - 
 it would seem like an very interesting idea to limit interleaving concurrent 
sessions WAL records between each other and exploit the recent-buffer 
enhancement to avoid repeating the same calls to Smgr, wouldn't it? (I'm just 
mentioning it as I saw you were benchmarking it here and called out this idea).

I could be wrong though with many of those simplifications, in any case please 
consult with Thomas as he knows much better and is much more trusted source 
than me 😉

-J.

Reply via email to