On 2014-02-01 19:47:29 -0800, Peter Geoghegan wrote:
> Here are the results of a benchmark on Nathan Boley's 64-core, 4
> socket server: 
> http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/amd-4-socket-rwlocks/

That's interesting. The maximum number of what you see here (~293125)
is markedly lower than what I can get.

... poke around ...

Hm, that's partially because you're using pgbench without -M prepared if
I see that correctly. The bottleneck in that case is primarily memory
allocation. But even after that I am getting higher
numbers: ~342497.

Trying to nail down the differnce it oddly seems to be your
max_connections=80 vs my 100. The profile in both cases is markedly
different, way much more spinlock contention with 80. All in

I think =80 has to lead to some data being badly aligned. I can
reproduce that =91 has *much* better performance than =90. 170841.844938
vs 368490.268577 in a 10s test. Reproducable both with an without the test.
That's certainly worth some investigation.
This is *not* reproducable on the intel machine, so it might the
associativity of the L1/L2 cache on the AMD.

> Perhaps I should have gone past 64 clients, because in the document
> "Lock Scaling Analysis on Intel Xeon Processors" [1], Intel write:
> "This implies that with oversubscription (more threads running than
> available logical CPUs), the performance of spinlocks can depend
> heavily on the exact OS scheduler behavior, and may change drastically
> with operating system or VM updates."
> I haven't bothered with a higher client counts though, because Andres
> noted it's the same with 90 clients on this AMD system. Andres: Do you
> see big problems when # clients < # logical cores on the affected
> Intel systems?

There's some slowdown with the patch applied, but it's not big. Without
it, the slowdown is much earlier.

> There is only a marginal improvement in performance on this big 4
> socket system. Andres informs me privately that he has reproduced the
> problem on multiple new 4-socket Intel servers, so it seems reasonable
> to suppose that more or less an Intel thing.

I've just poked around, it's not just 4 socket, but also 2 socket

Some background:
The setups that triggered me into working on the patchset didn't really
have a pgbench like workload, the individual queries were/are more
complicated even though it's still an high throughput OLTP workload. And
the contention was *much* higher than what I can reproduce with pgbench
-S, there was often nearly all time spent in the lwlock's spinlock, and
it was primarily the buffer mapping lwlocks, being locked in shared
mode. The difference is that instead of locking very few buffers per
query like pgbench does, they touched much more.
If you look at a profile of a pgbench -S workload -cj64 it's pretty much all
bottlenecked by GetSnapshotData():
-  40.91%  postgres_plainl  postgres_plainlw    [.] s_lock
   - s_lock
      - 51.34% LWLockAcquire
         - GetTransactionSnapshot
            + 55.23% PortalStart
            + 44.77% PostgresMain
      - 48.66% LWLockRelease
         - GetTransactionSnapshot
            + 53.64% PostgresMain
            + 46.36% PortalStart
+   2.65%          pgbench  [kernel.kallsyms]   [k] try_to_wake_up
+   2.61%  postgres_plainl  postgres_plainlw    [.] LWLockRelease
+   1.36%  postgres_plainl  postgres_plainlw    [.] LWLockAcquire
+   1.25%  postgres_plainl  postgres_plainlw    [.] GetSnapshotData

-   2.94%       postgres  postgres              [.] LWLockAcquire
   - LWLockAcquire
      + 26.61% ReadBuffer_common
      + 17.08% GetSnapshotData
      + 10.71% _bt_relandgetbuf
      + 10.24% LockAcquireExtended
      + 10.11% VirtualXactLockTableInsert
      + 9.17% VirtualXactLockTableCleanup
      + 7.55% _bt_getbuf
      + 3.40% index_fetch_heap
      + 1.51% LockReleaseAll
      + 0.89% StartTransactionCommand
      + 0.83% _bt_next
      + 0.75% LockRelationOid
      + 0.52% ReadBufferExtended
-   2.75%       postgres  postgres              [.] _bt_compare
   - _bt_compare
      + 96.72% _bt_binsrch
      + 1.71% _bt_moveright
      + 1.29% _bt_search
-   2.67%       postgres  postgres              [.] GetSnapshotData
   - GetSnapshotData
      + 97.03% GetTransactionSnapshot
      + 2.16% PostgresMain
      + 0.81% PortalStart

So now the profile looks much saner/less contended which immediately is
visible in transaction rates: 192584.218530 vs 552834.002573.

But if you try to attack the contention from the other end, by setting
default_transaction_isolation='repeatable read' to reduce the number of
snapshots taken, its suddenly 536789.807391 vs 566839.328922. A *much*
smaller benefit.
The reason the patch doesn't help that much with that setting is that there
simply isn't as much actual contention there:

+   2.77%       postgres  postgres            [.] _bt_compare
-   2.72%       postgres  postgres            [.] LWLockAcquire
   - LWLockAcquire
      + 30.51% ReadBuffer_common
      + 12.45% VirtualXactLockTableInsert
      + 12.07% _bt_relandgetbuf
      + 10.61% VirtualXactLockTableCleanup
      + 9.95% LockAcquireExtended
      + 8.59% GetSnapshotData
      + 7.51% _bt_getbuf
      + 2.54% index_fetch_heap
      + 1.47% LockReleaseAll
      + 1.16% StartTransactionCommand
      + 0.95% LockRelationOid
      + 0.89% _bt_next
      + 0.77% ReadBufferExtended
-   2.19%       postgres  postgres            [.] s_lock
   - s_lock
      + 52.41% PinBuffer
      + 47.04% UnpinBuffer

While locking is still visible, it's not so extraordinarily dominant. If
you then profile using -e stalled-cycles-backend, you can see that the
actual problematic pipeline stalls are caused by Pin/UnpinBuffer. So
that's now the bottleneck. Since that doesn't use lwlocks, it's not
surprising to see an lwlocks patch doesn't bring as much benefit.

> So, I think it's fair to say, given what we now know from Andres'
> numbers and the numbers I got from Nathan's server, that this patch is
> closer to being something that addresses a particularly unfortunate
> pathology on many-socket Intel system than it is to being a general
> performance optimization.

I think one would see bigger improvements if we'd ha testcase that also
caused real lock contention in the AMDs. If you look at a profile there:

unpatched, -c default_transaction_isolation=read committed
-   3.64%  postgres_plainl  postgres_plainlw    [.] s_lock
   - s_lock
      + 51.49% LWLockAcquire
      + 45.19% LWLockRelease
      + 1.73% PinBuffer
      + 1.59% UnpinBuffer
-   3.46%  postgres_plainl  postgres_plainlw    [.] LWLockAcquire
   - LWLockAcquire
      + 36.67% GetSnapshotData
      + 24.61% ReadBuffer_common
      + 11.45% _bt_relandgetbuf
      + 6.78% _bt_getbuf
      + 5.99% LockAcquireExtended
      + 5.27% VirtualXactLockTableInsert
      + 5.00% VirtualXactLockTableCleanup
      + 1.52% _bt_next
      + 1.31% index_fetch_heap
      + 0.76% LockReleaseAll
-   2.90%  postgres_plainl  postgres_plainlw    [.] GetSnapshotData
   - GetSnapshotData
      + 97.40% GetTransactionSnapshot
      + 2.31% PostgresMain

unpatched, -c default_transaction_isolation=repeatable read

-   2.78%  postgres_plainl  postgres_plainlw    [.] LWLockAcquire
   - LWLockAcquire
      + 35.41% ReadBuffer_common
      + 15.12% _bt_relandgetbuf
      + 11.78% GetSnapshotData
      + 9.29% _bt_getbuf
      + 7.36% VirtualXactLockTableInsert
      + 7.25% LockAcquireExtended
      + 6.67% VirtualXactLockTableCleanup
      + 1.98% index_fetch_heap
      + 1.88% _bt_next
      + 1.47% LockReleaseAll
      + 0.50% StartTransactionCommand
-   2.21%  postgres_plainl  postgres_plainlw    [.] hash_search_with_hash_value
   - hash_search_with_hash_value
      + 53.23% BufTableLookup
      + 13.92% LockAcquireExtended
      + 8.99% GetPortalByName
      + 6.79% RelationIdGetRelation
      + 4.98% FetchPreparedStatement
      + 4.43% CreatePortal
      + 3.61% PortalDrop
      + 2.46% RemoveLocalLock

There's obviously not even close as much contention as in the intel
case. Which shows in the benchmark results:
440910.048200 vs. 473789.980486

So, the lwlock patch cannot improve concurrency by the degree it did for
intel, since there simply isn't much of a bottleneck. But I'd be
surprised if there aren't cases with much more prominent bottlenecks,
this just isn't one of them.

The changed algorithm for lwlock imo is an *algorithmic* improvement,
not one for a particular architecture. The advantage being that locking
a lwlock which is primarily taken in shared mode will never need need to
wait or loop.

> Based on the above quoted passage, it isn't
> unreasonable to suppose other vendors or architectures could be
> affected, but that isn't in evidence. While I welcome the use of
> atomic operations in the context of LW_SHARED acquisition as general
> progress, I think that to the outside world my proposed messaging is
> more useful. It's not quite a bug-fix, but if you're using a
> many-socket Intel server, you're *definitely* going to want to use a
> PostgreSQL version that is unaffected. You may well not want to take
> on the burden of waiting for 9.4, or waiting for it to fully
> stabilize.

> I note that Andres has a feature branch of this backported to Postgres
> 9.2, no doubt because of a request from a 2ndQuadrant customer. I have
> to wonder if we should think about making this available with a
> configure switch in one or more back branches.

Yes, that branch is used by some of them. But to make that clear to all
that are still reading, I have *first* presented the patch & findings to
-hackers and *then* backported it, and I have referenced the existance
of the patch for 9.2 on list before. This isn't some kind of "secret
sauce" deal...

But note that there is one significant difference between the 9.2 and
HEAD version, the former directly uses gcc intrinsics, without having
any form of checks or abstraction across compilers...

> I think that the
> complete back-porting of the fsync request queue issue's fix in commit
> 758728 could be considered a precedent - that too was a fix for a
> really bad worst-case that was encountered fairly infrequently in the
> wild. It's sort of horrifying to have red-hot spinlocks in production,
> so that seems like the kind of thing we should make an effort to
> address for those running multi-socket systems. Of those running
> Postgres on new multi-socket systems, the reality is that the majority
> are running on Intel hardware. Unfortunately, everyone knows that
> Intel will soon be the only game in town when it comes to high-end
> x86_64 servers, which contributes to my feeling that we need to target
> back branches. We should do something about the possible regression
> with older compilers using the fallback first, though.

That might be something to do later, as it *really* can hurt in
practice. We had one server go from load 240 to 11...

But I think we should first focus on getting the patch ready for
master, then we can see where it's going. At the very least I'd like to
split of the part modifying the current spinlocks to use the atomics,
that seems far to invasive.

> It would be useful to know more about the nature of the problem that
> made such an appreciable difference in Andres' original post.

I unfortunately can't tell you that much more, not because it's private,
but because it mostly was diagnosed by remote hand debugging, limiting
insights considerably.
What I *do* know is that the bottleneck entirely was caused by the
buffer mapping lwlocks, all taken in shared mode according to call graph
profiles. The call graph and the queries I have seen indicated that lots
of the frequent queries involved nested loops over not inconsiderable
number of pages.
We've also tried to increase the number of buffer mapping locks, but
that didn't prove to be very helpful.

Hm, that went on a bit...


Andres Freund

 Andres Freund                     http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:

Reply via email to