Hi, While discussing this patch with Robert off-list, one of the questions he asked was is there's some size threshold after which it starts to have negative impact. I didn't have a good answer to that - I did have some intuition (that making it too large would not hurt), but I haven't done any tests with "extreme" sizes of the fast-path structs.
So I ran some more tests, with up to 4096 "groups" (which means 64k fast-path slots). And no matter how I slice the results, there's no clear regression points, beyond which the performance would start to decline (even just slowly). It's the same for all benchmarks, client counts, query mode, and so on. I'm attaching two PDFs with results for the "join" benchmark I described earlier (query with a join on many partitions) from EPYC 7763 (64/128c). The first one is with "raw" data (throughput = tps), the second one is relative throughput to the first column (which is pretty much current master, with no optimizations applied). The complete results including some nice .odp spreadsheets and scripts are available here: https://github.com/tvondra/pg-lock-scalability-results There's often a very clear point where the performance significantly improves - this is usually when all the relation locks start to fit into the fast-path array. With 1000 relations that's ~64 groups, and so on. But there's no point where it would start declining. My explanation is that the PGPROC (where the fast-path array is) is so large already (close to 1kB), that making it large does not really cause any additional cache misses, etc. And if it does, it's far out-weighted by cost of accessing (or not having to) the shared lock table. So I don't think there's any point at which point we'd start to regress, at least not because of cache misses, CPU etc. It stops improving, but that's just a sign that we've hit some other bottleneck - that's not a fault of this patch. But that's not the whole story, of course. Because if there were no issues, why not to just make the fast-path array insanely large? In another off-list discussion Andres asked me about the memory this would need, and after looking at the numbers I think that's a strong argument to keep the numbers reasonable. I did a quick experiment to see the per-connection memory requirements, and how would it be affected by this patch. I simply logged the amount of shared memory CalculateShmemSize(), started the server with 100 and 1000 connections, and did a bit of math to calculate how much memory we need "per connection". For master and different numbers of fast-path groups I got this: master 64 1024 32765 --------------------------------- 47668 52201 121324 2406892 So by default we need ~48kB / connection, with 64 groups we need ~52kB (which makes sense because that's 1024 x 4B slots), and then with 1024 slots we get to 120kB etc and with 32k ~2.5MB. I guess those higher values seem a bit insane - we don't want to just increase the per-connection memory requirements 50x for everyone, right? But what about the people who actually want this many locks? Let's bump the max_locks_per_transactions from 64 to 1024, and we get this: master 64 1024 32765 ------------------------------------- 419367 423909 493022 2778590 Suddenly, the differences are much smaller, especially for the 64 groups, which is roughly the same number of fast-path slots as the max locks per transactions. That shrunk to ~1% difference. But wen for 1024 groups it's now just ~20%, which I think it well worth the benefits. And likely something the system should have available - with 1000 connections that's ~80MB. And if you run with 1000 connections, 80MB should be rounding error, IMO. Of course, it does not seem great to force everyone to pay this price, even if they don't need that many locks (and so there's no benefit). So how would we improve that? I don't think that's possible with hard-coded size of the array - that allocates the memory for everyone. We'd need to make it variable-length, and while doing those benchmarks I think we actually already have a GUC for that - max_locks_per_transaction tells us exactly what we need to know, right? I mean, if I know I'll need ~1000 locks, why not to make the fast-path array large enough for that? Of course, the consequence of this would be making PGPROC variable length, or having to point to a memory allocated separately (I prefer the latter option, I think). I haven't done any experiments, but it seems fairly doable - of course, not sure if it might be more expensive compared to compile-time constants. At this point I think it's fairly clear we have significant bottlenecks when having to lock many relations - and that won't go away, thanks to partitioning etc. We're already fixing various other bottlenecks for these workloads, which will just increase pressure on locking. Fundamentally, I think we'll need to either evolve the fast-path system to handle more relations (the limit of 16 was always rather quite low), or invent some entirely new thing that does something radical (say, locking a "group" of relations instead of locking them one by one). This patch is doing the first thing, and IMHO the increased memory consumption is a sensible / acceptable trade off. I'm not sure of any proposal for the second approach, and I don't have any concrete idea how it might work. regards -- Tomas Vondra
join-epyc-relative.pdf
Description: Adobe PDF document
join-epyc-data.pdf
Description: Adobe PDF document