> > > On Fri, Sep 13, 2024 at 12:16 PM Anthony D'Atri <[email protected]> wrote: >> My sense is that with recent OS and kernel releases (e.g., not CentOS 8) >> irqbalance does a halfway decent job. > > Strongly disagree! Canonical has actually disabled it by default in > Ubuntu 24.04 and IIRC Debian already does, too: > https://discourse.ubuntu.com/t/ubuntu-24-04-lts-noble-numbat-release-notes/39890#irqbalance-no-more-installed-and-enabled-by-default
Interesting. The varied viewpoints of the Ceph community are invaluable. Reading the above page, I infer that recent kernels do well by default now? > While irqbalance _can_ do a decent job in some scenarios, it can also > really mess things up. For something like Ceph where you are likely > running a lot of the same platform(s) and are seeking predictability, > you can probably do better controlling affinity yourself. At least, > you should be able to do no worse. Fair enough, would love to > >>> I came across a recent Ceph day NYC talk from Tyler Stachecki (Bloomberg) >>> [1] and a Reddit post [2]. Apparently there is quita a bit of performance >>> to gain when NUMA is optimally configured for Ceph. >> >> My sense is that NUMA is very much a function of what CPUs one is using, and >> 1S vs 2S / 4S. With 4S servers I've seen people using multiple NICs, >> multiple HBAs, etc., effectively partitioning into 4x 1S servers. Why not >> save yourself hassle and just use 1S to begin with? 4+S-capable CPUs cost >> more and sometimes lag generationally. > > Hey, that's me! I first saw an elaborate 4S pinning scheme at an OpenStack Summit, 2016 or so. > As Anthony says, YMMV based on your platform, what you use Ceph for > (RBD?), and also how much Ceph you're running. > > Early versions of Zen had quite bad core to core memory latency when > you hopped across CCD/CCX. There’s a graphic out there comparing those latencies for …. IIRC, Icelake and Rome or Milan. > There's some early warning signs in the Zen > 5 client reviews that such latencies may be back to bite (I have not > gotten my hands on one yet, nor have I see anyone explain "why" yet): Ouch. Would one interpret this as Genoa being better? > https://www.anandtech.com/show/21524/the-amd-ryzen-9-9950x-and-ryzen-9-9900x-review/3 > > In the diagram within that article you can clearly see the ~180ns > difference, as well as the "striping" effect, when you cross a CCX. > I'm wondering this is a byproduct of the new ladder cache design > within the Zen 5 CCX? Regardless: if you have latencies like this > within a single socket, you likely stand to gain something by pinning > processes to NUMA nodes even with 1P servers. The results mentioned in > my presentation are all based on 1P platforms as well for comparison. Which presentation? I want to read through that carefully. I’m about to deploy a bunch of 1S EPYC 9454 systems with 30TB SSDs for RBD, RGW, and perhaps later CephFS. After clamoring for 1S systems for years I finally got my wish, now I want to optimize them as best I can, especially with 12x 30TB SSDs each (PCI-e Gen 4, QLC and TLC). Bonded 100GE. In the past I inherited scripting that spread HBA and NIC interrupts across physical cores (every other thread) and messed with the CPU governor, but have not dived deeply into NVMe interrupts yet. > >>> So what is most optimal there? Does it still make sense to have the Ceph >>> processes bound to the CPU where their respective NVMe resides when the >>> network interface card is attached to another CPU / NUMA node? Or would >>> this just result in more inter NUMA traffic (latency) and negate any >>> possible gains that could have been made? > > I never benchmarked this, so I can only guess. > > However: if you look at /proc/interrupts, you will see that most if > not all enterprise NVMes in Linux effectively get allocated a MSI > vector per thread per NVMe. Moreover, if you look at > /proc/<irq>/smp_affinity for each of those MSI vectors, you will see > that they are each pinned to exactly one CPU thread. > > In my experience, when NUMA pinning OSDs, only the MSI vectors local > to the NUMA node where the OSD runs really have any activity. That > seems optimal, so I've never had a reason to look any further. > >>> So the default policy seems to be active, and no Ceph NUMA affinity seems >>> to have taken place. Can someone explain me what Ceph (cephadm) is >>> currently doing when the "osd_numa_auto_affinity" config setting is true >>> and NUMA is exposed? > > I, personally, am in the camp of folk who are not cephadm fans. What I > did in my case was to write a shim that sits in front of the > [email protected] unit, effectively overriding the default > ExecStart=/usr/bin/ceph-osd.... and replacing it with > ExecStart=/usr/local/bin/my_numa_shim /usr/bin/ceph-osd... > > The my_numa_shim is a tool which has some apriori knowledge of how the > platforms are configured, and makes a decision about which NUMA node > to use for a given OSD after probing which NUMA node is most local to > the storage device associated with the OSD. It then sets the > affinity/memory allocation mode of the process and does an execve to > call /usr/bin/ceph-osd as systemd had originally intended. The pinning > is not changed by the execve. Is that tool available? > > Would something similar work with cephadm? Probably, but offhand I > have no idea how to implement it. > > Cheers, > Tyler > _______________________________________________ > ceph-users mailing list -- [email protected] > To unsubscribe send an email to [email protected] _______________________________________________ ceph-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
