Hi, On Tue, Jul 08, 2025 at 10:19:07PM -0400, Andres Freund wrote: > On 2025-06-30 13:36:12 +0000, Bertrand Drouvot wrote: > > * we don't want to allocate memory in some places where the counters are > > incremented (see 4feba03d8b9). We could still use the same implementation > > as for > > backend statistics (i.e, make use of flush_static_cb) if we really > > want/need to > > switch to variable stats. > > Even without allocating memory, this makes wait events considerably more > expensive. Going from ~4 instructions to something more like ~40. Including > external function calls.
Yeah looking at the performance impacts and discuss those was in my todo (as mentioned above), thanks for looking at it! > end wait: > 0x0000000000a08eaa <+90>: mov (%r12),%rax Yeah this one is unbeatable, it will have to be more expensive... On my side, I'm getting: => 0x000000000091a320 <+0>: mov 0x5337f9(%rip),%rax # 0xe4db20 <my_wait_event_info> 0x000000000091a327 <+7>: mov (%rax),%eax 0x000000000091a329 <+9>: mov %eax,%edx 0x000000000091a32b <+11>: and $0xff000000,%edx 0x000000000091a331 <+17>: jne 0x91a340 <waitEventIncrementCounter+32> 0x000000000091a333 <+19>: test %ax,%ax 0x000000000091a336 <+22>: jne 0x91a340 <waitEventIncrementCounter+32> 0x000000000091a338 <+24>: ret 0x000000000091a339 <+25>: nopl 0x0(%rax) 0x000000000091a340 <+32>: cmp $0x1000000,%edx 0x000000000091a346 <+38>: jne 0x91a34e <waitEventIncrementCounter+46> 0x000000000091a348 <+40>: cmp $0x5e,%ax 0x000000000091a34c <+44>: ja 0x91a338 <waitEventIncrementCounter+24> 0x000000000091a34e <+46>: cmp $0x7000000,%edx 0x000000000091a354 <+52>: sete %cl 0x000000000091a357 <+55>: test %ax,%ax 0x000000000091a35a <+58>: setne %al 0x000000000091a35d <+61>: test %al,%cl 0x000000000091a35f <+63>: jne 0x91a338 <waitEventIncrementCounter+24> 0x000000000091a361 <+65>: cmp $0xb000000,%edx 0x000000000091a367 <+71>: je 0x91a338 <waitEventIncrementCounter+24> 0x000000000091a369 <+73>: mov %edi,%eax 0x000000000091a36b <+75>: movzwl %di,%edi 0x000000000091a36e <+78>: shr $0x18,%eax 0x000000000091a371 <+81>: shl $0x5,%rax 0x000000000091a375 <+85>: movslq 0xe4d9c8(%rax),%rax 0x000000000091a37c <+92>: movb $0x1,0x56045d(%rip) # 0xe7a7e0 <have_wait_event_stats> 0x000000000091a383 <+99>: add %rdi,%rax 0x000000000091a386 <+102>: addq $0x1,0xe7a800(,%rax,8) 0x000000000091a38f <+111>: ret > That's about an order of magnitude increase in overhead, just based on the > number of instructions, and probably considerably xmore in runtime (you're > adding external function calls etc). Right. > I think to make this viable, we would have to remove some wait events and be > more careful about adding them. You mean incrementing the counters for wait events for which we expect that the increment is "negligible" as compare to the wait event itself? (Like DATAFILE IO and such)? If so I agree and that's something I also have discussed with a few (also if we add the timings later), that we could/should be more granular. If we go that way we could also imagine a GUC to set the stats for all of them? (default being off). > And optimize this code a good bit. This > really shouldn't end up with this mouch code. Yes, with the idea mentioned in [1] (code not shared yet), I end up with: Dump of assembler code for function waitEventIncrementCounter: => 0x000000000091b320 <+0>: mov %edi,%eax 0x000000000091b322 <+2>: and $0xff000000,%eax 0x000000000091b327 <+7>: jne 0x91b330 <waitEventIncrementCounter+16> 0x000000000091b329 <+9>: test %di,%di 0x000000000091b32c <+12>: jne 0x91b330 <waitEventIncrementCounter+16> 0x000000000091b32e <+14>: ret 0x000000000091b32f <+15>: nop 0x000000000091b330 <+16>: cmp $0x1000000,%eax 0x000000000091b335 <+21>: jne 0x91b33d <waitEventIncrementCounter+29> 0x000000000091b337 <+23>: cmp $0x5e,%di 0x000000000091b33b <+27>: ja 0x91b32e <waitEventIncrementCounter+14> 0x000000000091b33d <+29>: cmp $0x7000000,%eax 0x000000000091b342 <+34>: sete %cl 0x000000000091b345 <+37>: test %di,%di 0x000000000091b348 <+40>: setne %dl 0x000000000091b34b <+43>: test %dl,%cl 0x000000000091b34d <+45>: jne 0x91b32e <waitEventIncrementCounter+14> 0x000000000091b34f <+47>: cmp $0xb000000,%eax 0x000000000091b354 <+52>: je 0x91b32e <waitEventIncrementCounter+14> 0x000000000091b356 <+54>: mov %edi,%eax 0x000000000091b358 <+56>: movzwl %di,%edi 0x000000000091b35b <+59>: shr $0x18,%eax 0x000000000091b35e <+62>: mov 0xe4e9c0(,%rax,8),%rax 0x000000000091b366 <+70>: movb $0x1,0x5603f3(%rip) # 0xe7b760 <have_wait_event_stats> 0x000000000091b36d <+77>: mov 0x8(%rax),%rax 0x000000000091b371 <+81>: addq $0x1,(%rax,%rdi,8) 0x000000000091b376 <+86>: ret That looks better before the first branch (could be done in v1 too) as it uses registers only but looks worse when it's time to increment the counter while the code is much simpler: " WAIT_EVENT_INFO_DECODE(waitInfo, wait_event_info); AllWaitClasses[waitInfo.classId]->pendingCounts[eventId]++; " instead of " counter = waitEventGetCounter(PendingWaitEventStats.counts, waitInfo.classId, waitInfo.eventId); (*counter)++; " with the associated machinery in waitEventGetCounter(). Indeed, the waitEventGetCounter() produced only one memory read (and one write): movslq 0xe4d9c8(%rax),%rax addq $0x1,0xe7a800(,%rax,8) while the AllWaitClasses[waitInfo.classId]->pendingCounts[eventId]++ produces 2 pointer dereferences and 1 write: mov 0xe4e9c0(,%rax,8),%rax mov 0x8(%rax),%rax addq $0x1,(%rax,%rdi,8) I still need to work on it and try to make the code simpler and effective. [1]: https://www.postgresql.org/message-id/aG04Zk%2Bb3SRN0pBe%40ip-10-97-1-34.eu-west-3.compute.internal Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com