On Tue, Dec 22, 2015 at 10:39 AM, Aleksander Alekseev
> Obviously after splitting a freelist into NUM_LOCK_PARTITIONS
> partitions (and assuming that all necessary locking/unlocking is done
> on calling side) tables can't have more than NUM_LOCK_PARTITIONS
> partitions because it would cause race conditions. For this reason I
> had to define NUM_BUFFER_PARTITIONS as NUM_LOCK_PARTITIONS and compare
> behaviour of PostgreSQL depending on different values of
This is not at all obvious. There is no obvious reason why the number
of ways that the freelist is partitioned needs to have any relation at
all to the number of ways that the hash table itself is partitioned.
Probably, each process should pick a freelist affinity based on
MyBackendId % number_of_partitions or something like that. If it
doesn't find one there, it can search the others, but always starting
with the one for which it has an affinity. That way, the number of
processes contending on a single spinlock will be limited to
max_connections/number_of_partitions except when the table is very
nearly full. That may happen a lot, though, for the buffer mapping
table, so we might need to over-allocate to compensate. We want it to
be the case that a backend will almost always find a free entry on its
own freelist when it needs one, and only occasionally need to visit
any other freelist.
> Now regarding 60-core server:
> - One spinlock per hash table doesn't scale. I personally was expecting
> - LWLock's and array of spinlocks do scale on NUMA up to a certain
> - Best results are shown by "no locks";
> I believe that "no locks" implementation is the best one since it is at
> least 3 times faster on NUMA then any other implementation. Also it is
> simpler and doesn't have stealing-from-other-freelists logic that
> executes rarely and therefore is a likely source of bugs. Regarding ~16
> elements of freelists which in some corner cases could but wouldn't be
> used --- as I mentioned before I believe its not such a big problem.
> Also its a small price to pay for 3 times more TPS.
I doubt it. Having the system become fundamentally less reliable is a
big cost. If the system tries to find a lock table entry and fails,
the user's query is going to error out. They are not at that point
going to say, well, it's good that my system runs fast even though I
can't run pg_dump. They are going to say, running pg_dump used to
work and now it fails. The consequences of failing to make an entry
in the buffer mapping table are at least as bad.
And there's a second point, too, which is that you haven't proven that
accepting the costs of your proposed model is even necessary. You've
tried a few experiments with other approaches, not even all of the
ones that were proposed in the earlier thread, and you don't seem to
have done any investigation of why they didn't perform as well, or if
you did, it's not discussed in this email. So maybe those problems
can be fixed, after all.
> Regarding NUM_LOCK_PARTITIONS (and NUM_BUFFER_PARTITIONS) I have some
> doubts. For sure Robert had a good reason for committing 3acc10c9.
> Unfortunately I'm not familiar with a story behind this commit. What do
> you think?
See the thread "Scaling shared buffer eviction".
The Enterprise PostgreSQL Company
Sent via pgsql-hackers mailing list (firstname.lastname@example.org)
To make changes to your subscription: