Merlin Moncure <> wrote:
> On Sun, Jun 8, 2014 at 5:45 PM, Kevin Grittner <> wrote:

> Hm, your patch seems to boil down to
>   interleave_memory(start, size, numa_all_nodes_ptr)
> inside PGSharedMemoryCreate().

That's the functional part -- the rest is about not breaking the
builds for environments which are not NUMA-aware.

> I've read your email a couple of times and am a little hazy
> around a couple of points, in particular: "the above is probably
> more important than the attached patch".  So I have a couple of
> questions:
> *) There is a lot of advice floating around (for example here:
>  )
> to instruct operators to disable zone_reclaim.  Will your changes
> invalidate any of that advice?

I expect that it will make the need for that far less acute,
although it is probably still best to disable zone_reclaim (based
on the documented conditions under which disabling it makes sense).

> *) is there any downside to enabling --with-libnuma if you have
> support?

Not that I can see.  There are two additional system calls on
postmaster start-up.  I don't expect the time those take to be

> Do you expect packagers will enable it generally?

I suspect so.

> Why not just always build it in (if configure allows it) and rely
> on a GUC if there is some kind of tradeoff (and if there is one,
> what kinds of things are you looking for to manage it)?

If a build is done on a machine with the NUMA library, and the
executable is deployed on a machine without it, the postmaster will
get an error on the missing library.  I talked about this briefly
with Tom in Ottawa, and he thought that it would be up to packagers
to create a dependency on the library if they build PostgreSQL
using the --with-libnuma option.  The reason to require the option
is so that a build is not created which won't run on target
machines if a packagers does nothing to deal with NUMA.

> *) The bash script above, what problem does the 'alternate
> policy' solve?

By default, all OS buffers and cache is located in the memory node
closest to the process which does the read or write which first
causes it to be used.  For something like the cp command, that
probably makes sense.  For something like PostgreSQL it can lead to
unbalanced placement of shared resources (like pages in shared
tables and indexes).

> *) What kinds of improvements (even if in very general terms)
> will we see from better numa management?  Are there further
> optimizations possible?

When I spread both OS cache and PostgreSQL shared memory, I got
about 2% better performance overall for a read-only load on a 4
node system which started with everything on one node.  I used
pgbench and picked a scale which put the database size at about 25%
of machine memory before I initialized the database, so that one
memory node was 100% filled with minimal "spill" to the other
nodes.  The run times between the two cases had very minimal
overlap.  The balanced memory usage had more consistent results;
the unbalance load had more variable performance timings, with a
rare run showing better than all the balanced times.

I didn't spend as much time with read/write benchmarks but those
seemed overall worse for the unbalance load, and one outlier on the
bad side was about 20% below the (again, pretty tightly clustered)
times for the balanced load.

These tests were designed to try to create a pretty bad case for
the unbalanced load in a default cpuset configuration and just an
unlucky sizing of the working set relative to a memory node size. 
At PGCon I had a discussion over lunch with someone who saw far
worse performance from unbalance memory, but he carefully
engineered a really bad case by using one cpuset to force all data
into one node, and then another cpuset to force PostgreSQL to run
only on cores from which access to that node was relatively slow. 
If I remember correctly, he saw about 20% of the throughput that way
versus using the same cores with balanced memory usage.  He
conceded that this was a pretty artificial case, and you would
have to be *trying* to hurt performance to set things up that way,
but he wanted to establish a "worst case" so that he had a hard
bounding of what the maximum possible benefit from balancing load
might be.

There is definitely a need for more benchmarks and benchmarks on
more environments, but my preliminary tests all looked favorable to
the combination of this patch and the cpuset changes.  I would have
posted this months ago if I had found enough time to do more
benchmarks and put together a nice presentation of the results, but
I figured it was a good idea to put this in front of people even
with only preliminary results, so that if others were interested in
doing so they could see what results they got in their
environments or with workloads I had not considered.

I will note that given the wide differences I saw between run times
with the unbalanced memory usage, there must be some variable that
matters which I was not properly controlling.  I still haven't
figured out what that was.  It might be something as simple as a
particular process (like the checkpoint or bgwriter process?)
landing on the fully-allocated memory node versus landing somewhere

I will also note that if the buffers and cache are populated by
small OLTP queries running on a variety of cores, the data can be
spread just by happenstance, and in that case this patch should not
be expected to make any difference at all.

Kevin Grittner
The Enterprise PostgreSQL Company

Sent via pgsql-hackers mailing list (
To make changes to your subscription:

Reply via email to