[ceph-users] Re: Numa pinning best practices

Anthony D'Atri Sat, 14 Sep 2024 17:12:50 -0700


> 
> 
> On Fri, Sep 13, 2024 at 12:16 PM Anthony D'Atri <[email protected]> wrote:
>> My sense is that with recent OS and kernel releases (e.g., not CentOS 8) 
>> irqbalance does a halfway decent job.
> 
> Strongly disagree! Canonical has actually disabled it by default in
> Ubuntu 24.04 and IIRC Debian already does, too:
> https://discourse.ubuntu.com/t/ubuntu-24-04-lts-noble-numbat-release-notes/39890#irqbalance-no-more-installed-and-enabled-by-default


Interesting.  The varied viewpoints of the Ceph community are invaluable.

Reading the above page, I infer that recent kernels do well by default now?

> While irqbalance _can_ do a decent job in some scenarios, it can also
> really mess things up. For something like Ceph where you are likely
> running a lot of the same platform(s) and are seeking predictability,
> you can probably do better controlling affinity yourself. At least,
> you should be able to do no worse.

Fair enough, would love to 

> 
>>> I came across a recent Ceph day NYC talk from Tyler Stachecki (Bloomberg) 
>>> [1] and a Reddit post [2]. Apparently there is quita a bit of performance 
>>> to gain when NUMA is optimally configured for Ceph.
>> 
>> My sense is that NUMA is very much a function of what CPUs one is using, and 
>> 1S vs 2S / 4S.  With 4S servers I've seen people using multiple NICs, 
>> multiple HBAs, etc., effectively partitioning into 4x 1S servers.  Why not 
>> save yourself hassle and just use 1S to begin with?  4+S-capable CPUs cost 
>> more and sometimes lag generationally.
> 
> Hey, that's me!

I first saw an elaborate 4S pinning scheme at an OpenStack Summit, 2016 or so.

> As Anthony says, YMMV based on your platform, what you use Ceph for
> (RBD?), and also how much Ceph you're running.
> 
> Early versions of Zen had quite bad core to core memory latency when
> you hopped across CCD/CCX.

There’s a graphic out there comparing those latencies for …. IIRC, Icelake and 
Rome or Milan.

> There's some early warning signs in the Zen
> 5 client reviews that such latencies may be back to bite (I have not
> gotten my hands on one yet, nor have I see anyone explain "why" yet):

Ouch.  Would one interpret this as Genoa being better?

> https://www.anandtech.com/show/21524/the-amd-ryzen-9-9950x-and-ryzen-9-9900x-review/3
> 
> In the diagram within that article you can clearly see the ~180ns
> difference, as well as the "striping" effect, when you cross a CCX.
> I'm wondering this is a byproduct of the new ladder cache design
> within the Zen 5 CCX? Regardless: if you have latencies like this
> within a single socket, you likely stand to gain something by pinning
> processes to NUMA nodes even with 1P servers. The results mentioned in
> my presentation are all based on 1P platforms as well for comparison.

Which presentation?  I want to read through that carefully.  I’m about to 
deploy a bunch of 1S EPYC 9454 systems with 30TB SSDs for RBD, RGW, and perhaps 
later CephFS.  After clamoring for 1S systems for years I finally got my wish, 
now I want to optimize them as best I can, especially with 12x 30TB SSDs each 
(PCI-e Gen 4, QLC and TLC).    Bonded 100GE.

In the past I inherited scripting that spread HBA and NIC interrupts across 
physical cores (every other thread) and messed with the CPU governor, but have 
not dived deeply into NVMe interrupts yet.



> 
>>> So what is most optimal there? Does it still make sense to have the Ceph 
>>> processes bound to the CPU where their respective NVMe resides when the 
>>> network interface card is attached to another CPU / NUMA node? Or would 
>>> this just result in more inter NUMA traffic (latency) and negate any 
>>> possible gains that could have been made?
> 
> I never benchmarked this, so I can only guess.
> 
> However: if you look at /proc/interrupts, you will see that most if
> not all enterprise NVMes in Linux effectively get allocated a MSI
> vector per thread per NVMe. Moreover, if you look at
> /proc/<irq>/smp_affinity for each of those MSI vectors, you will see
> that they are each pinned to exactly one CPU thread.
> 
> In my experience, when NUMA pinning OSDs, only the MSI vectors local
> to the NUMA node where the OSD runs really have any activity. That
> seems optimal, so I've never had a reason to look any further.
> 
>>> So the default policy seems to be active, and no Ceph NUMA affinity seems 
>>> to have taken place. Can someone explain me what Ceph (cephadm) is 
>>> currently doing when the "osd_numa_auto_affinity" config setting is true 
>>> and NUMA is exposed?
> 
> I, personally, am in the camp of folk who are not cephadm fans. What I
> did in my case was to write a shim that sits in front of the
> [email protected] unit, effectively overriding the default
> ExecStart=/usr/bin/ceph-osd.... and replacing it with
> ExecStart=/usr/local/bin/my_numa_shim /usr/bin/ceph-osd...
> 
> The my_numa_shim is a tool which has some apriori knowledge of how the
> platforms are configured, and makes a decision about which NUMA node
> to use for a given OSD after probing which NUMA node is most local to
> the storage device associated with the OSD. It then sets the
> affinity/memory allocation mode of the process and does an execve to
> call /usr/bin/ceph-osd as systemd had originally intended. The pinning
> is not changed by the execve.

Is that tool available?

> 
> Would something similar work with cephadm? Probably, but offhand I
> have no idea how to implement it.
> 
> Cheers,
> Tyler
> _______________________________________________
> ceph-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]

_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: Numa pinning best practices

Reply via email to