Hi Tomas!

On Fri, Jun 27, 2025 at 6:41 PM Tomas Vondra <to...@vondra.me> wrote:

> I agree we should improve the behavior on NUMA systems. But I'm not sure
> this patch does far enough, or perhaps the approach seems a bit too
> blunt, ignoring some interesting stuff.
>
> AFAICS the patch essentially does the same thing as
>
>    numactl --interleave=all
>
> except that it only does that to shared memory, not to process private
> memory (as if we called numa_set_localalloc). Which means it has some of
> the problems people observe with --interleave=all.
>
> In particular, this practically guarantees that (with 4K memory pages),
> each buffer hits multiple NUMA nodes. Because with the first half will
> do to node N, while the second half goes to node (N+1).
>
> That doesn't seem great. It's likely better than a misbalanced system
> with everything allocated on a single NUMA node, but I don't see how it
> could be better than "balanced" properly warmed up system where the
> buffers are not split like this.
>
> But OK, admittedly this only happens for 4K memory pages, and serious
> systems with a lot of memory are likely to use huge pages, which makes
> this less of an issue (only the buffers crossing the page boundaries
> might get split).
>
>
> My bigger comment however is that the approach focuses on balancing the
> nodes (i.e. ensuring each node gets a fair share of shared memory), and
> is entirely oblivious to the internal structure of the shared memory.
>
> * It interleaves the shared segment, but it has many pieces - shared
> buffers are the largest but not the only one. Does it make sense to
> interleave all the other pieces?
>
> * Some of the pieces are tightly related. For example, we talk about
> shared buffers as if it was one big array, but it actually is two arrays
> - blocks and descriptors. Even if buffers don't get split between nodes
> (thanks to huge pages), there's no guarantee the descriptor for the
> buffer does not end on a different node.
>
> * In fact, the descriptors are so much smaller that blocks that it's
> practically guaranteed all descriptors will end up on a single node.
>
>
> I could probably come up with a couple more similar items, but I think
> you get the idea. I do think making Postgres NUMA-aware will require
> figuring out how to distribute (or not distribute) different parts of
> the shared memory, and do that explicitly. And do that in a way that
> allows us to do other stuff in NUMA-aware way, e.g. have a separate
> freelists and clocksweep for each NUMA node, etc.

I do understand what you mean, but I'm *NOT* stating here that it
makes PG fully "NUMA-aware". I actually try to avoid doing so with
each sentence. This is only about the imbalance problem specifically.
I think we could build those follow-up optimizations as separate
patches in this or follow-up threads. If we would do it all in one
giant 0001 (without split) the very first question would be to
quantify the impact of each of those optimizations (for which we would
probably need more GUCs?). Here I'm just showing that the very first
baby step - interleaving - helps avoid interconnect saturation in some
cases too.

Anyway, even putting the fact that local mallocs() would be
interleaved, adjusting systemd startup scripts to just include
`numactl --interleave=all` sounds like some dirty hack not like proper
UX.

Also please note that:
* I do not have lot of time to dedicate towards it, yet I was kind of
always interested in researching that and wondering why we couldn't it
for such long time, therefore the previous observability work and now
$subject (note it is not claiming to be full blown NUMA awareness,
just some basic NUMA interleave as first [well, second?] step).
* I've raised this question in the first post "How to name this GUC
(numa or numa_shm_interleave) ?" I still have no idea, but `numa`,
simply looks better, and we could just add way more stuff to it over
time (in PG19 or future versions?). Does that sound good?

> That's something numa_interleave_memory simply can't do for us, and I
> suppose it might also have other downsides on large instances. I mean,
> doesn't it have to create a separate mapping for each memory page?
> Wouldn't that be a bit inefficient/costly for big instances?

No? Or what kind of mapping do you have in mind? I think our shared
memory on the kernel side is just a single VMA (contiguous memory
region), on which technically we execute mbind() (libnuma is just a
wrapper around it). I have not observed any kind of regressions,
actually quite the opposite. Not sure what you also mean by 'big
instances' (AFAIK 1-2TB shared_buffers might even fail to start).

> Of course, I'm not saying all this as a random passerby - I've been
> working on a similar patch for a while, based on Andres' experimental
> NUMA branch. It's far from complete/perfect, more of a PoC quality, but
> I hope to share it on the mailing list sometime soon.

Cool, I didn't know Andres's branch was public till now, I know he
referenced multiple issues in presentation (and hackathon!), but I
wanted to divide it and try to get something in at least partially,
step by step, to have at least something. I think we should
collaborate (not a lot of people interested in this?) and I can try to
offer my limited help if you attack those more advanced problems. I
think we could improve this by properly ensuring that by
over(allocating)/spreading/padding certain special regions (e.g.
better distribute ProcArray, but what about cache hits?) - we get more
juice, or do you want to start from scratch and re-design/re-think all
shm allocations case by case?

> FWIW while I think the patch doesn't go far enough, there's one area
> where I think it probably goes way too far - configurability. I agree
> it's reasonable to allow running on a subset of nodes, e.g. to split the
> system between multiple instances etc. But do we need to configure that
> from Postgres? Aren't people likely to already use something like
> containers or k8 anyway?
> I think we should just try to inherit this from
> the environment, i.e. determine which nodes we're allowed to run, and
> use that. Maybe we'll find we need to be smarter, but I think we caan
> leave that for later.

That's what "numa=all" is all about (take whatever is there in the
OS/namespace), but I do not know a better way than just let's say
numa_get_mems_allowed() being altered somehow by namespace/cgroups. I
think if one runs on k8/containers then it's quite limited/small
deployment and he wouldn't benefit from this at all (I struggle to
imagine the point of k8 pod using 2+ sockets), quite contrary: my
experience indicates that the biggest deployments are usually almost
baremetal? And it's way easier to get consistent results. Anyway as
You say, let's leave it for later. PG currently often is not CPU-aware
(i.e. is not even adjusting sizing of certain structs based on CPU
count), so making it NUMA-aware or cgroup/namespace-aware sounds
already like taking 2-3 steps ahead into future [I think we had
discussion at least one in LWLock partitionmanager /
FP_LOCK_SLOTS_PER_BACKEND where I've proposed to size certain
structures based on $VCPUs or I am misremembering this]

-J.


Reply via email to