[ceph-users] IRQ balancing, distribution
Hello, not really specific to Ceph, but since one of the default questions by the Ceph team when people are facing performance problems seems to be Have you tried turning it off and on again? ^o^ err, Are all your interrupts on one CPU? I'm going to wax on about this for a bit and hope for some feedback from others with different experiences and architectures than me. Now firstly that question if all your IRQ handling is happening on the same CPU is a valid one, as depending on a bewildering range of factors ranging from kernel parameters to actual hardware one often does indeed wind up with that scenario, usually with all on CPU0. Which certainly is the case with all my recent hardware and Debian kernels. I'm using nearly exclusively AMD CPUs (Opteron 42xx, 43xx and 63xx) and thus feedback from Intel users is very much sought after, as I'm considering Intel based storage nodes in the future. It's vaguely amusing that Ceph storage nodes seem to have more CPU (individual core performance, not necessarily # of cores) and similar RAM requirements than my VM hosts. ^o^ So the common wisdom is that all IRQs on one CPU is a bad thing, lest it gets overloaded and for example drop network packets because of this. And while that is true, I'm hard pressed to generate any load on my clusters where the IRQ ratio on CPU0 goes much beyond 50%. Thus it should come as no surprise that spreading out IRQs with irqbalance or more accurately by manually setting the /proc/irq/xx/smp_affinity mask doesn't give me any discernible differences when it comes to benchmark results. With irqbalance spreading things out willy-nilly w/o any regards or knowledge about the hardware and what IRQ does what it's definitely something I won't be using out of the box. This goes especially for systems with different NUMA regions without proper policyscripts for irqbalance. So for my current hardware I'm going to keep IRQs on CPU0 and CPU1 which are the same Bulldozer module and thus sharing L2 and L3 cache. In particular the AHCI (journal SSDs) and HBA or RAID controller IRQs on CPU0 and the network (Infiniband) on CPU1. That should give me sufficient reserves in processing power and keep intra core (module) and NUMA (additional physical CPUs) traffic to a minimum. This also will (within a certain load range) allow these 2 CPUs (module) to be ramped up to full speed while other cores can remain at a lower frequency. Now with Intel some PCIe lanes are handled by a specific CPU (that's why you often see the need for adding a 2nd CPU to use all slots) and in that case pinning the IRQ handling for those slots on a specific CPU might actually make a lot of sense. Especially if not all the traffic generated by that card will have to transferred to the other CPU anyway. Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] IRQ balancing, distribution
hi christian, we once were debugging some performance isssues, and IRQ balancing was one of the issues we looked in, but no real benefit there for us. all interrupts on one cpu is only an issue if the hardware itself is not the bottleneck. we were running some default SAS HBA (Dell H200), and those simply can't generated enough load to cause any IRQ issue even on older AMD cpus (we did tests on R515 boxes). (there was a ceph persentation somewhere that highlights the impact of using the proper the disk controller, we'll have to fix that first in our case. i'll be happy if IRQ balancing actually becomes an issue ;) but another issue is the OSD processes: do you pin those as well? and how much data do they actually handle. to checksum, the OSD process needs all data, so that can also cause a lot of NUMA traffic, esp if they are not pinned. i sort of hope that current CPUs have enough pcie lanes and cores so we can use single socket nodes, to avoid at least the NUMA traffic. stijn not really specific to Ceph, but since one of the default questions by the Ceph team when people are facing performance problems seems to be Have you tried turning it off and on again? ^o^ err, Are all your interrupts on one CPU? I'm going to wax on about this for a bit and hope for some feedback from others with different experiences and architectures than me. Now firstly that question if all your IRQ handling is happening on the same CPU is a valid one, as depending on a bewildering range of factors ranging from kernel parameters to actual hardware one often does indeed wind up with that scenario, usually with all on CPU0. Which certainly is the case with all my recent hardware and Debian kernels. I'm using nearly exclusively AMD CPUs (Opteron 42xx, 43xx and 63xx) and thus feedback from Intel users is very much sought after, as I'm considering Intel based storage nodes in the future. It's vaguely amusing that Ceph storage nodes seem to have more CPU (individual core performance, not necessarily # of cores) and similar RAM requirements than my VM hosts. ^o^ So the common wisdom is that all IRQs on one CPU is a bad thing, lest it gets overloaded and for example drop network packets because of this. And while that is true, I'm hard pressed to generate any load on my clusters where the IRQ ratio on CPU0 goes much beyond 50%. Thus it should come as no surprise that spreading out IRQs with irqbalance or more accurately by manually setting the /proc/irq/xx/smp_affinity mask doesn't give me any discernible differences when it comes to benchmark results. With irqbalance spreading things out willy-nilly w/o any regards or knowledge about the hardware and what IRQ does what it's definitely something I won't be using out of the box. This goes especially for systems with different NUMA regions without proper policyscripts for irqbalance. So for my current hardware I'm going to keep IRQs on CPU0 and CPU1 which are the same Bulldozer module and thus sharing L2 and L3 cache. In particular the AHCI (journal SSDs) and HBA or RAID controller IRQs on CPU0 and the network (Infiniband) on CPU1. That should give me sufficient reserves in processing power and keep intra core (module) and NUMA (additional physical CPUs) traffic to a minimum. This also will (within a certain load range) allow these 2 CPUs (module) to be ramped up to full speed while other cores can remain at a lower frequency. Now with Intel some PCIe lanes are handled by a specific CPU (that's why you often see the need for adding a 2nd CPU to use all slots) and in that case pinning the IRQ handling for those slots on a specific CPU might actually make a lot of sense. Especially if not all the traffic generated by that card will have to transferred to the other CPU anyway. Christian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] IRQ balancing, distribution
Hello, On Mon, 22 Sep 2014 09:35:10 +0200 Stijn De Weirdt wrote: hi christian, we once were debugging some performance isssues, and IRQ balancing was one of the issues we looked in, but no real benefit there for us. all interrupts on one cpu is only an issue if the hardware itself is not the bottleneck. In particular the spinning rust. ^o^ But this crept up in recent discussions about all SSD OSD storage servers, so there is some (remote) possibility for this to happen. we were running some default SAS HBA (Dell H200), and those simply can't generated enough load to cause any IRQ issue even on older AMD cpus (we did tests on R515 boxes). (there was a ceph persentation somewhere that highlights the impact of using the proper the disk controller, we'll have to fix that first in our case. i'll be happy if IRQ balancing actually becomes an issue ;) Yeah, this pretty much matches what I'm seeing and experienced over the years. but another issue is the OSD processes: do you pin those as well? and how much data do they actually handle. to checksum, the OSD process needs all data, so that can also cause a lot of NUMA traffic, esp if they are not pinned. That's why all my (production) storage nodes have only a single 6 or 8 core CPU. Unfortunately that also limits the amount of RAM in there, 16GB modules have just recently become an economically viable alternative to 8GB ones. Thus I don't pin OSD processes, given that on my 8 core nodes with 8 OSDs and 4 journal SSDs I can make Ceph eat babies and nearly all CPU (not IOwait!) resources with the right (or is that wrong) tests, namely 4K FIOs. The linux scheduler usually is quite decent in keeping processes where the action is, thus you see for example a clear preference of DRBD or KVM vnet processes to be near or on the CPU(s) where the IRQs are. i sort of hope that current CPUs have enough pcie lanes and cores so we can use single socket nodes, to avoid at least the NUMA traffic. Even the lackluster Opterons with just PCIe v2 and less lanes than current Intel CPUs are plenty fast enough (sufficient bandwidth) when it comes to the storage node density I'm deploying. Christian stijn not really specific to Ceph, but since one of the default questions by the Ceph team when people are facing performance problems seems to be Have you tried turning it off and on again? ^o^ err, Are all your interrupts on one CPU? I'm going to wax on about this for a bit and hope for some feedback from others with different experiences and architectures than me. Now firstly that question if all your IRQ handling is happening on the same CPU is a valid one, as depending on a bewildering range of factors ranging from kernel parameters to actual hardware one often does indeed wind up with that scenario, usually with all on CPU0. Which certainly is the case with all my recent hardware and Debian kernels. I'm using nearly exclusively AMD CPUs (Opteron 42xx, 43xx and 63xx) and thus feedback from Intel users is very much sought after, as I'm considering Intel based storage nodes in the future. It's vaguely amusing that Ceph storage nodes seem to have more CPU (individual core performance, not necessarily # of cores) and similar RAM requirements than my VM hosts. ^o^ So the common wisdom is that all IRQs on one CPU is a bad thing, lest it gets overloaded and for example drop network packets because of this. And while that is true, I'm hard pressed to generate any load on my clusters where the IRQ ratio on CPU0 goes much beyond 50%. Thus it should come as no surprise that spreading out IRQs with irqbalance or more accurately by manually setting the /proc/irq/xx/smp_affinity mask doesn't give me any discernible differences when it comes to benchmark results. With irqbalance spreading things out willy-nilly w/o any regards or knowledge about the hardware and what IRQ does what it's definitely something I won't be using out of the box. This goes especially for systems with different NUMA regions without proper policyscripts for irqbalance. So for my current hardware I'm going to keep IRQs on CPU0 and CPU1 which are the same Bulldozer module and thus sharing L2 and L3 cache. In particular the AHCI (journal SSDs) and HBA or RAID controller IRQs on CPU0 and the network (Infiniband) on CPU1. That should give me sufficient reserves in processing power and keep intra core (module) and NUMA (additional physical CPUs) traffic to a minimum. This also will (within a certain load range) allow these 2 CPUs (module) to be ramped up to full speed while other cores can remain at a lower frequency. Now with Intel some PCIe lanes are handled by a specific CPU (that's why you often see the need for adding a 2nd CPU to use all slots) and in that case pinning the IRQ handling for those slots on a specific CPU might actually make a lot of sense. Especially if not all the
Re: [ceph-users] IRQ balancing, distribution
On Mon, Sep 22, 2014 at 10:21 AM, Christian Balzer ch...@gol.com wrote: The linux scheduler usually is quite decent in keeping processes where the action is, thus you see for example a clear preference of DRBD or KVM vnet processes to be near or on the CPU(s) where the IRQs are. Since you're just mentioning it: DRBD, for one, needs to *tell* the kernel that its sender, receiver and worker threads should be on the same CPU. It has done that for some time now, but you shouldn't assume that this is some kernel magic that DRBD can just use. Not suggesting that you're unaware of this, but the casual reader might be. :) Cheers, Florian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] IRQ balancing, distribution
but another issue is the OSD processes: do you pin those as well? and how much data do they actually handle. to checksum, the OSD process needs all data, so that can also cause a lot of NUMA traffic, esp if they are not pinned. That's why all my (production) storage nodes have only a single 6 or 8 core CPU. Unfortunately that also limits the amount of RAM in there, 16GB modules have just recently become an economically viable alternative to 8GB ones. Thus I don't pin OSD processes, given that on my 8 core nodes with 8 OSDs and 4 journal SSDs I can make Ceph eat babies and nearly all CPU (not IOwait!) resources with the right (or is that wrong) tests, namely 4K FIOs. The linux scheduler usually is quite decent in keeping processes where the action is, thus you see for example a clear preference of DRBD or KVM vnet processes to be near or on the CPU(s) where the IRQs are. the scheduler has improved recently, but i don't know since what version (certainly not backported to RHEL6 kernel). pinning the OSDs might actually be a bad idea, unless the page cache is flushed before each osd restart. kernel VM has this nice feature where allocating memory in a NUMA domain does not trigger freeing of cache memory in the domain, but it will first try to allocate memory on another NUMA domain. although typically the VM cache will be maxed out on OSD boxes, i'm not sure the cache clearing itself is NUMA aware, so who knows where the memory is located when it's allocated. stijn ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] IRQ balancing, distribution
Page reclamation in Linux is NUMA aware. So page reclamation is not an issue. You can see performance improvements only if all the components of a given IO completes on a single core. This is hard to achieve in Ceph as a single IO goes through multiple thread switches and the threads are not bound to any core. Starting an OSD with numactl and binding it to one core might aggravate the problem as all the threads spawned by that OSD will compete for the CPU on a single core. OSD with default configuration has 20+ threads . Binding the OSD process to one core using taskset does not help as some memory (especially heap) may be already allocated on the other NUMA node. Looks the design principle followed is to fan out by spawning multiple threads at each of the pipelining stage to utilize the available cores in the system. Because the IOs won't complete on the same core as issued, lots of cycles are lost for cache coherency. Regards, Anand -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Stijn De Weirdt Sent: Monday, September 22, 2014 2:36 PM To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] IRQ balancing, distribution but another issue is the OSD processes: do you pin those as well? and how much data do they actually handle. to checksum, the OSD process needs all data, so that can also cause a lot of NUMA traffic, esp if they are not pinned. That's why all my (production) storage nodes have only a single 6 or 8 core CPU. Unfortunately that also limits the amount of RAM in there, 16GB modules have just recently become an economically viable alternative to 8GB ones. Thus I don't pin OSD processes, given that on my 8 core nodes with 8 OSDs and 4 journal SSDs I can make Ceph eat babies and nearly all CPU (not IOwait!) resources with the right (or is that wrong) tests, namely 4K FIOs. The linux scheduler usually is quite decent in keeping processes where the action is, thus you see for example a clear preference of DRBD or KVM vnet processes to be near or on the CPU(s) where the IRQs are. the scheduler has improved recently, but i don't know since what version (certainly not backported to RHEL6 kernel). pinning the OSDs might actually be a bad idea, unless the page cache is flushed before each osd restart. kernel VM has this nice feature where allocating memory in a NUMA domain does not trigger freeing of cache memory in the domain, but it will first try to allocate memory on another NUMA domain. although typically the VM cache will be maxed out on OSD boxes, i'm not sure the cache clearing itself is NUMA aware, so who knows where the memory is located when it's allocated. stijn ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] IRQ balancing, distribution
hi, Page reclamation in Linux is NUMA aware. So page reclamation is not an issue. except for the first min_free_kbytes? those can come from anywhere, no? or is the reclamation such that it tries to free equal portion for each NUMA domain. if the OSD allocates memory in chunks smaller then that value, you might be lucky. You can see performance improvements only if all the components of a given IO completes on a single core. This is hard to achieve in Ceph as a single IO goes through multiple thread switches and the threads are not bound to any core. Starting an OSD with numactl and binding it to one core might aggravate the problem as all the threads spawned by that OSD will compete for the CPU on a single core. OSD with default configuration has 20+ threads . Binding the OSD process to one core using taskset does not help as some memory (especially heap) may be already allocated on the other NUMA node. this is not true if you start the process under numactl, is it? but binding an OSD to a NUMA domain makes sense. Looks the design principle followed is to fan out by spawning multiple threads at each of the pipelining stage to utilize the available cores in the system. Because the IOs won't complete on the same core as issued, lots of cycles are lost for cache coherency. is intel HT a solution/help for this? turn on HT and start the OSD on the L2 (e.g. with hwloc-bind) as a more general question, the recommendation for ceph to have one cpu core for each OSD; can these be HT cores or actual physical cores? stijn Regards, Anand -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Stijn De Weirdt Sent: Monday, September 22, 2014 2:36 PM To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] IRQ balancing, distribution but another issue is the OSD processes: do you pin those as well? and how much data do they actually handle. to checksum, the OSD process needs all data, so that can also cause a lot of NUMA traffic, esp if they are not pinned. That's why all my (production) storage nodes have only a single 6 or 8 core CPU. Unfortunately that also limits the amount of RAM in there, 16GB modules have just recently become an economically viable alternative to 8GB ones. Thus I don't pin OSD processes, given that on my 8 core nodes with 8 OSDs and 4 journal SSDs I can make Ceph eat babies and nearly all CPU (not IOwait!) resources with the right (or is that wrong) tests, namely 4K FIOs. The linux scheduler usually is quite decent in keeping processes where the action is, thus you see for example a clear preference of DRBD or KVM vnet processes to be near or on the CPU(s) where the IRQs are. the scheduler has improved recently, but i don't know since what version (certainly not backported to RHEL6 kernel). pinning the OSDs might actually be a bad idea, unless the page cache is flushed before each osd restart. kernel VM has this nice feature where allocating memory in a NUMA domain does not trigger freeing of cache memory in the domain, but it will first try to allocate memory on another NUMA domain. although typically the VM cache will be maxed out on OSD boxes, i'm not sure the cache clearing itself is NUMA aware, so who knows where the memory is located when it's allocated. stijn ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] IRQ balancing, distribution
On 09/22/2014 01:55 AM, Christian Balzer wrote: Hello, not really specific to Ceph, but since one of the default questions by the Ceph team when people are facing performance problems seems to be Have you tried turning it off and on again? ^o^ err, Are all your interrupts on one CPU? I'm going to wax on about this for a bit and hope for some feedback from others with different experiences and architectures than me. This may be a result of me harping about this after a customer's clusters had mysterious performance issues and where irqbalance didn't appear to be working properly. :) Now firstly that question if all your IRQ handling is happening on the same CPU is a valid one, as depending on a bewildering range of factors ranging from kernel parameters to actual hardware one often does indeed wind up with that scenario, usually with all on CPU0. Which certainly is the case with all my recent hardware and Debian kernels. Yes, there are certainly a lot of scenarios where this can happen. I think the hope has been that with MSI-X, interrupts will get evenly distributed by default and that is typically better than throwing them all at core 0, but things are still quite complicated. I'm using nearly exclusively AMD CPUs (Opteron 42xx, 43xx and 63xx) and thus feedback from Intel users is very much sought after, as I'm considering Intel based storage nodes in the future. It's vaguely amusing that Ceph storage nodes seem to have more CPU (individual core performance, not necessarily # of cores) and similar RAM requirements than my VM hosts. ^o^ It might be reasonable to say that Ceph is a pretty intensive piece of software. With lots of OSDs on a system there are hundreds if not thousands of threads. Under heavy load conditions the CPUs, network cards, HBAs, memory, socket interconnects, possibly SAS expanders are all getting worked pretty hard and possibly in unusual ways where both throughput and latency are important. At the cluster scale things like switch bisection bandwidth and network topology become issues too. High performance clustered storage is imho one of the most complicated performance subjects in computing. The good news is that much of this can be avoided by sticking to simple designs with fewer OSDs per node. The more OSDs you try to stick in 1 system, the more you need to worry about all of this if you care about high performance. So the common wisdom is that all IRQs on one CPU is a bad thing, lest it gets overloaded and for example drop network packets because of this. And while that is true, I'm hard pressed to generate any load on my clusters where the IRQ ratio on CPU0 goes much beyond 50%. Thus it should come as no surprise that spreading out IRQs with irqbalance or more accurately by manually setting the /proc/irq/xx/smp_affinity mask doesn't give me any discernible differences when it comes to benchmark results. Ok, that's fine, but this is pretty subjective. Without knowing the load and the hardware setup I don't think we can really draw any conclusions other than that in your test on your hardware this wasn't the bottleneck. With irqbalance spreading things out willy-nilly w/o any regards or knowledge about the hardware and what IRQ does what it's definitely something I won't be using out of the box. This goes especially for systems with different NUMA regions without proper policyscripts for irqbalance. I believe irqbalance takes PCI topology into account when making mapping decisions. See: http://dcs.nac.uci.edu/support/sysadmin/security/archive/msg09707.html So for my current hardware I'm going to keep IRQs on CPU0 and CPU1 which are the same Bulldozer module and thus sharing L2 and L3 cache. In particular the AHCI (journal SSDs) and HBA or RAID controller IRQs on CPU0 and the network (Infiniband) on CPU1. That should give me sufficient reserves in processing power and keep intra core (module) and NUMA (additional physical CPUs) traffic to a minimum. This also will (within a certain load range) allow these 2 CPUs (module) to be ramped up to full speed while other cores can remain at a lower frequency. So it's been a while since I looked at AMD CPU interconnect topology, but back in the magnycours era I drew up some diagrams: 2 socket: https://docs.google.com/drawings/d/1_egexLqN14k9bhoN2nkv3iTgAbbPcwuwJmhwWAmakwo/edit?usp=sharing 4 socket: https://docs.google.com/drawings/d/1V5sFSInKq3uuKRbETx1LVOURyYQF_9Z4zElPrl1YIrw/edit?usp=sharing I think Interlagos looks somewhat similar from a hypertransport perspective. My gut instinct is that you really want to keep everything you can local to the socket on these kinds of systems. So if your HBA is on the first socket, you want your processing and interrupt handling there too. In the 4-socket configuration this is especially true. It's entirely possible that you may have to go through both an on-die and a inter-socket HT link before you get to a neighbour
Re: [ceph-users] IRQ balancing, distribution
Hello, On Mon, 22 Sep 2014 08:55:48 -0500 Mark Nelson wrote: On 09/22/2014 01:55 AM, Christian Balzer wrote: Hello, not really specific to Ceph, but since one of the default questions by the Ceph team when people are facing performance problems seems to be Have you tried turning it off and on again? ^o^ err, Are all your interrupts on one CPU? I'm going to wax on about this for a bit and hope for some feedback from others with different experiences and architectures than me. This may be a result of me harping about this after a customer's clusters had mysterious performance issues and where irqbalance didn't appear to be working properly. :) Now firstly that question if all your IRQ handling is happening on the same CPU is a valid one, as depending on a bewildering range of factors ranging from kernel parameters to actual hardware one often does indeed wind up with that scenario, usually with all on CPU0. Which certainly is the case with all my recent hardware and Debian kernels. Yes, there are certainly a lot of scenarios where this can happen. I think the hope has been that with MSI-X, interrupts will get evenly distributed by default and that is typically better than throwing them all at core 0, but things are still quite complicated. I'm using nearly exclusively AMD CPUs (Opteron 42xx, 43xx and 63xx) and thus feedback from Intel users is very much sought after, as I'm considering Intel based storage nodes in the future. It's vaguely amusing that Ceph storage nodes seem to have more CPU (individual core performance, not necessarily # of cores) and similar RAM requirements than my VM hosts. ^o^ It might be reasonable to say that Ceph is a pretty intensive piece of software. With lots of OSDs on a system there are hundreds if not thousands of threads. Under heavy load conditions the CPUs, network cards, HBAs, memory, socket interconnects, possibly SAS expanders are all getting worked pretty hard and possibly in unusual ways where both throughput and latency are important. At the cluster scale things like switch bisection bandwidth and network topology become issues too. High performance clustered storage is imho one of the most complicated performance subjects in computing. Nobody will argue that. ^.^ The good news is that much of this can be avoided by sticking to simple designs with fewer OSDs per node. The more OSDs you try to stick in 1 system, the more you need to worry about all of this if you care about high performance. I'd say that 8 OSDs isn't exactly dense (my case), but the advantages of less densely populated nodes come with the significant price tag of rack space and hardware costs. So the common wisdom is that all IRQs on one CPU is a bad thing, lest it gets overloaded and for example drop network packets because of this. And while that is true, I'm hard pressed to generate any load on my clusters where the IRQ ratio on CPU0 goes much beyond 50%. Thus it should come as no surprise that spreading out IRQs with irqbalance or more accurately by manually setting the /proc/irq/xx/smp_affinity mask doesn't give me any discernible differences when it comes to benchmark results. Ok, that's fine, but this is pretty subjective. Without knowing the load and the hardware setup I don't think we can really draw any conclusions other than that in your test on your hardware this wasn't the bottleneck. Of course, I can only realistically talk about what I have tested and thus invited feedback from others. I can certainly see situations where this could be an issue with Ceph and do have experience with VM hosts that benefited from spreading IRQ handling over more than one CPU. What I'm trying to get across is for people to not fall into a cargo cult trap and think/examine things for themselves, as blindly turning on indiscriminate IRQ balancing might do more harm than good in certain scenarios. With irqbalance spreading things out willy-nilly w/o any regards or knowledge about the hardware and what IRQ does what it's definitely something I won't be using out of the box. This goes especially for systems with different NUMA regions without proper policyscripts for irqbalance. I believe irqbalance takes PCI topology into account when making mapping decisions. See: http://dcs.nac.uci.edu/support/sysadmin/security/archive/msg09707.html I'm sure it tries to do the right thing and it gets at least some things right, like what my system (single Opteron 4386) looks like: --- Package 0: numa_node is 0 cpu mask is 00ff (load 0) Cache domain 0: numa_node is 0 cpu mask is 0003 (load 0) CPU number 0 numa_node is 0 (load 0) CPU number 1 numa_node is 0 (load 0) Cache domain 1: numa_node is 0 cpu mask is 000c (load 0) CPU number 2 numa_node is 0 (load 0)