Hi Tomas! On Fri, Jun 27, 2025 at 6:41 PM Tomas Vondra <to...@vondra.me> wrote:
> I agree we should improve the behavior on NUMA systems. But I'm not sure > this patch does far enough, or perhaps the approach seems a bit too > blunt, ignoring some interesting stuff. > > AFAICS the patch essentially does the same thing as > > numactl --interleave=all > > except that it only does that to shared memory, not to process private > memory (as if we called numa_set_localalloc). Which means it has some of > the problems people observe with --interleave=all. > > In particular, this practically guarantees that (with 4K memory pages), > each buffer hits multiple NUMA nodes. Because with the first half will > do to node N, while the second half goes to node (N+1). > > That doesn't seem great. It's likely better than a misbalanced system > with everything allocated on a single NUMA node, but I don't see how it > could be better than "balanced" properly warmed up system where the > buffers are not split like this. > > But OK, admittedly this only happens for 4K memory pages, and serious > systems with a lot of memory are likely to use huge pages, which makes > this less of an issue (only the buffers crossing the page boundaries > might get split). > > > My bigger comment however is that the approach focuses on balancing the > nodes (i.e. ensuring each node gets a fair share of shared memory), and > is entirely oblivious to the internal structure of the shared memory. > > * It interleaves the shared segment, but it has many pieces - shared > buffers are the largest but not the only one. Does it make sense to > interleave all the other pieces? > > * Some of the pieces are tightly related. For example, we talk about > shared buffers as if it was one big array, but it actually is two arrays > - blocks and descriptors. Even if buffers don't get split between nodes > (thanks to huge pages), there's no guarantee the descriptor for the > buffer does not end on a different node. > > * In fact, the descriptors are so much smaller that blocks that it's > practically guaranteed all descriptors will end up on a single node. > > > I could probably come up with a couple more similar items, but I think > you get the idea. I do think making Postgres NUMA-aware will require > figuring out how to distribute (or not distribute) different parts of > the shared memory, and do that explicitly. And do that in a way that > allows us to do other stuff in NUMA-aware way, e.g. have a separate > freelists and clocksweep for each NUMA node, etc. I do understand what you mean, but I'm *NOT* stating here that it makes PG fully "NUMA-aware". I actually try to avoid doing so with each sentence. This is only about the imbalance problem specifically. I think we could build those follow-up optimizations as separate patches in this or follow-up threads. If we would do it all in one giant 0001 (without split) the very first question would be to quantify the impact of each of those optimizations (for which we would probably need more GUCs?). Here I'm just showing that the very first baby step - interleaving - helps avoid interconnect saturation in some cases too. Anyway, even putting the fact that local mallocs() would be interleaved, adjusting systemd startup scripts to just include `numactl --interleave=all` sounds like some dirty hack not like proper UX. Also please note that: * I do not have lot of time to dedicate towards it, yet I was kind of always interested in researching that and wondering why we couldn't it for such long time, therefore the previous observability work and now $subject (note it is not claiming to be full blown NUMA awareness, just some basic NUMA interleave as first [well, second?] step). * I've raised this question in the first post "How to name this GUC (numa or numa_shm_interleave) ?" I still have no idea, but `numa`, simply looks better, and we could just add way more stuff to it over time (in PG19 or future versions?). Does that sound good? > That's something numa_interleave_memory simply can't do for us, and I > suppose it might also have other downsides on large instances. I mean, > doesn't it have to create a separate mapping for each memory page? > Wouldn't that be a bit inefficient/costly for big instances? No? Or what kind of mapping do you have in mind? I think our shared memory on the kernel side is just a single VMA (contiguous memory region), on which technically we execute mbind() (libnuma is just a wrapper around it). I have not observed any kind of regressions, actually quite the opposite. Not sure what you also mean by 'big instances' (AFAIK 1-2TB shared_buffers might even fail to start). > Of course, I'm not saying all this as a random passerby - I've been > working on a similar patch for a while, based on Andres' experimental > NUMA branch. It's far from complete/perfect, more of a PoC quality, but > I hope to share it on the mailing list sometime soon. Cool, I didn't know Andres's branch was public till now, I know he referenced multiple issues in presentation (and hackathon!), but I wanted to divide it and try to get something in at least partially, step by step, to have at least something. I think we should collaborate (not a lot of people interested in this?) and I can try to offer my limited help if you attack those more advanced problems. I think we could improve this by properly ensuring that by over(allocating)/spreading/padding certain special regions (e.g. better distribute ProcArray, but what about cache hits?) - we get more juice, or do you want to start from scratch and re-design/re-think all shm allocations case by case? > FWIW while I think the patch doesn't go far enough, there's one area > where I think it probably goes way too far - configurability. I agree > it's reasonable to allow running on a subset of nodes, e.g. to split the > system between multiple instances etc. But do we need to configure that > from Postgres? Aren't people likely to already use something like > containers or k8 anyway? > I think we should just try to inherit this from > the environment, i.e. determine which nodes we're allowed to run, and > use that. Maybe we'll find we need to be smarter, but I think we caan > leave that for later. That's what "numa=all" is all about (take whatever is there in the OS/namespace), but I do not know a better way than just let's say numa_get_mems_allowed() being altered somehow by namespace/cgroups. I think if one runs on k8/containers then it's quite limited/small deployment and he wouldn't benefit from this at all (I struggle to imagine the point of k8 pod using 2+ sockets), quite contrary: my experience indicates that the biggest deployments are usually almost baremetal? And it's way easier to get consistent results. Anyway as You say, let's leave it for later. PG currently often is not CPU-aware (i.e. is not even adjusting sizing of certain structs based on CPU count), so making it NUMA-aware or cgroup/namespace-aware sounds already like taking 2-3 steps ahead into future [I think we had discussion at least one in LWLock partitionmanager / FP_LOCK_SLOTS_PER_BACKEND where I've proposed to size certain structures based on $VCPUs or I am misremembering this] -J.