Re: scalability bottlenecks with (many) partitions (and more)

Tomas Vondra Sun, 01 Sep 2024 12:30:56 -0700

Hi,

While discussing this patch with Robert off-list, one of the questions
he asked was is there's some size threshold after which it starts to
have negative impact. I didn't have a good answer to that - I did have
some intuition (that making it too large would not hurt), but I haven't
done any tests with "extreme" sizes of the fast-path structs.


So I ran some more tests, with up to 4096 "groups" (which means 64k
fast-path slots). And no matter how I slice the results, there's no
clear regression points, beyond which the performance would start to
decline (even just slowly). It's the same for all benchmarks, client
counts, query mode, and so on.

I'm attaching two PDFs with results for the "join" benchmark I described
earlier (query with a join on many partitions) from EPYC 7763 (64/128c).
The first one is with "raw" data (throughput = tps), the second one is
relative throughput to the first column (which is pretty much current
master, with no optimizations applied).

The complete results including some nice .odp spreadsheets and scripts
are available here:

  https://github.com/tvondra/pg-lock-scalability-results

There's often a very clear point where the performance significantly
improves - this is usually when all the relation locks start to fit into
the fast-path array. With 1000 relations that's ~64 groups, and so on.
But there's no point where it would start declining.

My explanation is that the PGPROC (where the fast-path array is) is so
large already (close to 1kB), that making it large does not really cause
any additional cache misses, etc. And if it does, it's far out-weighted
by cost of accessing (or not having to) the shared lock table.

So I don't think there's any point at which point we'd start to regress,
at least not because of cache misses, CPU etc. It stops improving, but
that's just a sign that we've hit some other bottleneck - that's not a
fault of this patch.


But that's not the whole story, of course. Because if there were no
issues, why not to just make the fast-path array insanely large? In
another off-list discussion Andres asked me about the memory this would
need, and after looking at the numbers I think that's a strong argument
to keep the numbers reasonable.

I did a quick experiment to see the per-connection memory requirements,
and how would it be affected by this patch. I simply logged the amount
of shared memory CalculateShmemSize(), started the server with 100 and
1000 connections, and did a bit of math to calculate how much memory we
need "per connection".

For master and different numbers of fast-path groups I got this:

    master      64     1024     32765
    ---------------------------------
     47668   52201   121324   2406892

So by default we need ~48kB / connection, with 64 groups we need ~52kB
(which makes sense because that's 1024 x 4B slots), and then with 1024
slots we get to 120kB etc and with 32k ~2.5MB.

I guess those higher values seem a bit insane - we don't want to just
increase the per-connection memory requirements 50x for everyone, right?

But what about the people who actually want this many locks? Let's bump
the max_locks_per_transactions from 64 to 1024, and we get this:

    master        64      1024      32765
    -------------------------------------
    419367    423909    493022    2778590

Suddenly, the differences are much smaller, especially for the 64
groups, which is roughly the same number of fast-path slots as the max
locks per transactions. That shrunk to ~1% difference. But wen for 1024
groups it's now just ~20%, which I think it well worth the benefits.

And likely something the system should have available - with 1000
connections that's ~80MB. And if you run with 1000 connections, 80MB
should be rounding error, IMO.

Of course, it does not seem great to force everyone to pay this price,
even if they don't need that many locks (and so there's no benefit). So
how would we improve that?

I don't think that's possible with hard-coded size of the array - that
allocates the memory for everyone. We'd need to make it variable-length,
and while doing those benchmarks I think we actually already have a GUC
for that - max_locks_per_transaction tells us exactly what we need to
know, right? I mean, if I know I'll need ~1000 locks, why not to make
the fast-path array large enough for that?

Of course, the consequence of this would be making PGPROC variable
length, or having to point to a memory allocated separately (I prefer
the latter option, I think). I haven't done any experiments, but it
seems fairly doable - of course, not sure if it might be more expensive
compared to compile-time constants.


At this point I think it's fairly clear we have significant bottlenecks
when having to lock many relations - and that won't go away, thanks to
partitioning etc. We're already fixing various other bottlenecks for
these workloads, which will just increase pressure on locking.

Fundamentally, I think we'll need to either evolve the fast-path system
to handle more relations (the limit of 16 was always rather quite low),
or invent some entirely new thing that does something radical (say,
locking a "group" of relations instead of locking them one by one).

This patch is doing the first thing, and IMHO the increased memory
consumption is a sensible / acceptable trade off. I'm not sure of any
proposal for the second approach, and I don't have any concrete idea how
it might work.



regards

-- 
Tomas Vondra

join-epyc-relative.pdf
Description: Adobe PDF document

join-epyc-data.pdf
Description: Adobe PDF document

Re: scalability bottlenecks with (many) partitions (and more)

Reply via email to