[ceph-users] IRQ balancing, distribution

2014-09-22 Thread Christian Balzer

Hello,

not really specific to Ceph, but since one of the default questions by the
Ceph team when people are facing performance problems seems to be 
Have you tried turning it off and on again? ^o^ err, 
Are all your interrupts on one CPU? 
I'm going to wax on about this for a bit and hope for some feedback from
others with different experiences and architectures than me.

Now firstly that question if all your IRQ handling is happening on the
same CPU is a valid one, as depending on a bewildering range of factors
ranging from kernel parameters to actual hardware one often does indeed
wind up with that scenario, usually with all on CPU0. 
Which certainly is the case with all my recent hardware and Debian
kernels.

I'm using nearly exclusively AMD CPUs (Opteron 42xx, 43xx and 63xx) and
thus feedback from Intel users is very much sought after, as I'm
considering Intel based storage nodes in the future. 
It's vaguely amusing that Ceph storage nodes seem to have more CPU
(individual core performance, not necessarily # of cores) and similar RAM
requirements than my VM hosts. ^o^

So the common wisdom is that all IRQs on one CPU is a bad thing, lest it
gets overloaded and for example drop network packets because of this.
And while that is true, I'm hard pressed to generate any load on my
clusters where the IRQ ratio on CPU0 goes much beyond 50%.

Thus it should come as no surprise that spreading out IRQs with irqbalance
or more accurately by manually setting the /proc/irq/xx/smp_affinity mask
doesn't give me any discernible differences when it comes to benchmark
results. 

With irqbalance spreading things out willy-nilly w/o any regards or
knowledge about the hardware and what IRQ does what it's definitely
something I won't be using out of the box. This goes especially for systems
with different NUMA regions without proper policyscripts for irqbalance.

So for my current hardware I'm going to keep IRQs on CPU0 and CPU1 which
are the same Bulldozer module and thus sharing L2 and L3 cache. 
In particular the AHCI (journal SSDs) and HBA or RAID controller IRQs on
CPU0 and the network (Infiniband) on CPU1.
That should give me sufficient reserves in processing power and keep intra
core (module) and NUMA (additional physical CPUs) traffic to a minimum.
This also will (within a certain load range) allow these 2 CPUs (module)
to be ramped up to full speed while other cores can remain at a lower
frequency.

Now with Intel some PCIe lanes are handled by a specific CPU (that's why
you often see the need for adding a 2nd CPU to use all slots) and in that
case pinning the IRQ handling for those slots on a specific CPU might
actually make a lot of sense. Especially if not all the traffic generated
by that card will have to transferred to the other CPU anyway.


Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] IRQ balancing, distribution

2014-09-22 Thread Stijn De Weirdt

hi christian,

we once were debugging some performance isssues, and IRQ balancing was 
one of the issues we looked in, but no real benefit there for us.
all interrupts on one cpu is only an issue if the hardware itself is not 
the bottleneck. we were running some default SAS HBA (Dell H200), and 
those simply can't generated enough load to cause any IRQ issue even on 
older AMD cpus (we did tests on R515 boxes). (there was a ceph 
persentation somewhere that highlights the impact of using the proper 
the disk controller, we'll have to fix that first in our case. i'll be 
happy if IRQ balancing actually becomes an issue ;)


but another issue is the OSD processes: do you pin those as well? and 
how much data do they actually handle. to checksum, the OSD process 
needs all data, so that can also cause a lot of NUMA traffic, esp if 
they are not pinned.


i sort of hope that current CPUs have enough pcie lanes and cores so we 
can use single socket nodes, to avoid at least the NUMA traffic.


stijn


not really specific to Ceph, but since one of the default questions by the
Ceph team when people are facing performance problems seems to be
Have you tried turning it off and on again? ^o^ err,
Are all your interrupts on one CPU?
I'm going to wax on about this for a bit and hope for some feedback from
others with different experiences and architectures than me.

Now firstly that question if all your IRQ handling is happening on the
same CPU is a valid one, as depending on a bewildering range of factors
ranging from kernel parameters to actual hardware one often does indeed
wind up with that scenario, usually with all on CPU0.
Which certainly is the case with all my recent hardware and Debian
kernels.

I'm using nearly exclusively AMD CPUs (Opteron 42xx, 43xx and 63xx) and
thus feedback from Intel users is very much sought after, as I'm
considering Intel based storage nodes in the future.
It's vaguely amusing that Ceph storage nodes seem to have more CPU
(individual core performance, not necessarily # of cores) and similar RAM
requirements than my VM hosts. ^o^

So the common wisdom is that all IRQs on one CPU is a bad thing, lest it
gets overloaded and for example drop network packets because of this.
And while that is true, I'm hard pressed to generate any load on my
clusters where the IRQ ratio on CPU0 goes much beyond 50%.

Thus it should come as no surprise that spreading out IRQs with irqbalance
or more accurately by manually setting the /proc/irq/xx/smp_affinity mask
doesn't give me any discernible differences when it comes to benchmark
results.

With irqbalance spreading things out willy-nilly w/o any regards or
knowledge about the hardware and what IRQ does what it's definitely
something I won't be using out of the box. This goes especially for systems
with different NUMA regions without proper policyscripts for irqbalance.

So for my current hardware I'm going to keep IRQs on CPU0 and CPU1 which
are the same Bulldozer module and thus sharing L2 and L3 cache.
In particular the AHCI (journal SSDs) and HBA or RAID controller IRQs on
CPU0 and the network (Infiniband) on CPU1.
That should give me sufficient reserves in processing power and keep intra
core (module) and NUMA (additional physical CPUs) traffic to a minimum.
This also will (within a certain load range) allow these 2 CPUs (module)
to be ramped up to full speed while other cores can remain at a lower
frequency.

Now with Intel some PCIe lanes are handled by a specific CPU (that's why
you often see the need for adding a 2nd CPU to use all slots) and in that
case pinning the IRQ handling for those slots on a specific CPU might
actually make a lot of sense. Especially if not all the traffic generated
by that card will have to transferred to the other CPU anyway.


Christian


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] IRQ balancing, distribution

2014-09-22 Thread Christian Balzer

Hello,

On Mon, 22 Sep 2014 09:35:10 +0200 Stijn De Weirdt wrote:

 hi christian,
 
 we once were debugging some performance isssues, and IRQ balancing was 
 one of the issues we looked in, but no real benefit there for us.
 all interrupts on one cpu is only an issue if the hardware itself is not 
 the bottleneck. 
In particular the spinning rust. ^o^
But this crept up in recent discussions about all SSD OSD storage servers,
so there is some (remote) possibility for this to happen.

we were running some default SAS HBA (Dell H200), and 
 those simply can't generated enough load to cause any IRQ issue even on 
 older AMD cpus (we did tests on R515 boxes). (there was a ceph 
 persentation somewhere that highlights the impact of using the proper 
 the disk controller, we'll have to fix that first in our case. i'll be 
 happy if IRQ balancing actually becomes an issue ;)
 
Yeah, this pretty much matches what I'm seeing and experienced over the
years.

 but another issue is the OSD processes: do you pin those as well? and 
 how much data do they actually handle. to checksum, the OSD process 
 needs all data, so that can also cause a lot of NUMA traffic, esp if 
 they are not pinned.
 
That's why all my (production) storage nodes have only a single 6 or 8
core CPU. Unfortunately that also limits the amount of RAM in there, 16GB
modules have just recently become an economically viable alternative to
8GB ones.

Thus I don't pin OSD processes, given that on my 8 core nodes with 8 OSDs
and 4 journal SSDs I can make Ceph eat babies and nearly all CPU (not
IOwait!) resources with the right (or is that wrong) tests, namely 4K
FIOs. 

The linux scheduler usually is quite decent in keeping processes where the
action is, thus you see for example a clear preference of DRBD or KVM vnet
processes to be near or on the CPU(s) where the IRQs are.

 i sort of hope that current CPUs have enough pcie lanes and cores so we 
 can use single socket nodes, to avoid at least the NUMA traffic.
 
Even the lackluster Opterons with just PCIe v2 and less lanes than current
Intel CPUs are plenty fast enough (sufficient bandwidth) when it comes to
the storage node density I'm deploying.

Christian
 stijn
 
  not really specific to Ceph, but since one of the default questions by
  the Ceph team when people are facing performance problems seems to be
  Have you tried turning it off and on again? ^o^ err,
  Are all your interrupts on one CPU?
  I'm going to wax on about this for a bit and hope for some feedback
  from others with different experiences and architectures than me.
 
  Now firstly that question if all your IRQ handling is happening on the
  same CPU is a valid one, as depending on a bewildering range of factors
  ranging from kernel parameters to actual hardware one often does indeed
  wind up with that scenario, usually with all on CPU0.
  Which certainly is the case with all my recent hardware and Debian
  kernels.
 
  I'm using nearly exclusively AMD CPUs (Opteron 42xx, 43xx and 63xx) and
  thus feedback from Intel users is very much sought after, as I'm
  considering Intel based storage nodes in the future.
  It's vaguely amusing that Ceph storage nodes seem to have more CPU
  (individual core performance, not necessarily # of cores) and similar
  RAM requirements than my VM hosts. ^o^
 
  So the common wisdom is that all IRQs on one CPU is a bad thing, lest
  it gets overloaded and for example drop network packets because of
  this. And while that is true, I'm hard pressed to generate any load on
  my clusters where the IRQ ratio on CPU0 goes much beyond 50%.
 
  Thus it should come as no surprise that spreading out IRQs with
  irqbalance or more accurately by manually setting
  the /proc/irq/xx/smp_affinity mask doesn't give me any discernible
  differences when it comes to benchmark results.
 
  With irqbalance spreading things out willy-nilly w/o any regards or
  knowledge about the hardware and what IRQ does what it's definitely
  something I won't be using out of the box. This goes especially for
  systems with different NUMA regions without proper policyscripts for
  irqbalance.
 
  So for my current hardware I'm going to keep IRQs on CPU0 and CPU1
  which are the same Bulldozer module and thus sharing L2 and L3 cache.
  In particular the AHCI (journal SSDs) and HBA or RAID controller IRQs
  on CPU0 and the network (Infiniband) on CPU1.
  That should give me sufficient reserves in processing power and keep
  intra core (module) and NUMA (additional physical CPUs) traffic to a
  minimum. This also will (within a certain load range) allow these 2
  CPUs (module) to be ramped up to full speed while other cores can
  remain at a lower frequency.
 
  Now with Intel some PCIe lanes are handled by a specific CPU (that's
  why you often see the need for adding a 2nd CPU to use all slots) and
  in that case pinning the IRQ handling for those slots on a specific
  CPU might actually make a lot of sense. Especially if not all the
 

Re: [ceph-users] IRQ balancing, distribution

2014-09-22 Thread Florian Haas
On Mon, Sep 22, 2014 at 10:21 AM, Christian Balzer ch...@gol.com wrote:
 The linux scheduler usually is quite decent in keeping processes where the
 action is, thus you see for example a clear preference of DRBD or KVM vnet
 processes to be near or on the CPU(s) where the IRQs are.

Since you're just mentioning it: DRBD, for one, needs to *tell* the
kernel that its sender, receiver and worker threads should be on the
same CPU. It has done that for some time now, but you shouldn't assume
that this is some kernel magic that DRBD can just use. Not suggesting
that you're unaware of this, but the casual reader might be. :)

Cheers,
Florian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] IRQ balancing, distribution

2014-09-22 Thread Stijn De Weirdt

but another issue is the OSD processes: do you pin those as well? and
how much data do they actually handle. to checksum, the OSD process
needs all data, so that can also cause a lot of NUMA traffic, esp if
they are not pinned.


That's why all my (production) storage nodes have only a single 6 or 8
core CPU. Unfortunately that also limits the amount of RAM in there, 16GB
modules have just recently become an economically viable alternative to
8GB ones.

Thus I don't pin OSD processes, given that on my 8 core nodes with 8 OSDs
and 4 journal SSDs I can make Ceph eat babies and nearly all CPU (not
IOwait!) resources with the right (or is that wrong) tests, namely 4K
FIOs.

The linux scheduler usually is quite decent in keeping processes where the
action is, thus you see for example a clear preference of DRBD or KVM vnet
processes to be near or on the CPU(s) where the IRQs are.
the scheduler has improved recently, but i don't know since what version 
(certainly not backported to RHEL6 kernel).


pinning the OSDs might actually be a bad idea, unless the page cache is 
flushed before each osd restart. kernel VM has this nice feature where 
allocating memory in a NUMA domain does not trigger freeing of cache 
memory in the domain, but it will first try to allocate memory on 
another NUMA domain. although typically the VM cache will be maxed out 
on OSD boxes, i'm not sure the cache clearing itself is NUMA aware, so 
who knows where the memory is located when it's allocated.



stijn
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] IRQ balancing, distribution

2014-09-22 Thread Anand Bhat
Page reclamation in Linux is NUMA aware.  So page reclamation is not an issue.

You can see performance improvements only if all the components of a given IO 
completes  on a single core. This is hard to achieve in Ceph as a single IO 
goes through multiple thread switches and the threads are not bound to any 
core.  Starting an OSD with numactl  and binding it to one core might aggravate 
the problem as all the threads spawned by that OSD will compete for the CPU on 
a single core.  OSD with default configuration has 20+ threads .  Binding the 
OSD process to one core using taskset does not help as some memory (especially 
heap) may be already allocated on the other NUMA node.

Looks the design principle followed is to fan out by spawning multiple threads 
at each of the pipelining stage to utilize the available cores in the system.  
Because the IOs won't complete on the same core as issued, lots of cycles are 
lost for cache coherency.

Regards,
Anand



-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Stijn 
De Weirdt
Sent: Monday, September 22, 2014 2:36 PM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] IRQ balancing, distribution

 but another issue is the OSD processes: do you pin those as well? and
 how much data do they actually handle. to checksum, the OSD process
 needs all data, so that can also cause a lot of NUMA traffic, esp if
 they are not pinned.

 That's why all my (production) storage nodes have only a single 6 or 8
 core CPU. Unfortunately that also limits the amount of RAM in there,
 16GB modules have just recently become an economically viable
 alternative to 8GB ones.

 Thus I don't pin OSD processes, given that on my 8 core nodes with 8
 OSDs and 4 journal SSDs I can make Ceph eat babies and nearly all CPU
 (not
 IOwait!) resources with the right (or is that wrong) tests, namely 4K
 FIOs.

 The linux scheduler usually is quite decent in keeping processes where
 the action is, thus you see for example a clear preference of DRBD or
 KVM vnet processes to be near or on the CPU(s) where the IRQs are.
the scheduler has improved recently, but i don't know since what version 
(certainly not backported to RHEL6 kernel).

pinning the OSDs might actually be a bad idea, unless the page cache is flushed 
before each osd restart. kernel VM has this nice feature where allocating 
memory in a NUMA domain does not trigger freeing of cache memory in the domain, 
but it will first try to allocate memory on another NUMA domain. although 
typically the VM cache will be maxed out on OSD boxes, i'm not sure the cache 
clearing itself is NUMA aware, so who knows where the memory is located when 
it's allocated.


stijn
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] IRQ balancing, distribution

2014-09-22 Thread Stijn De Weirdt

hi,

Page reclamation in Linux is NUMA aware.  So page reclamation is not
an issue.

except for the first min_free_kbytes? those can come from anywhere, no? 
or is the reclamation such that it tries to free equal portion for each 
NUMA domain. if the OSD allocates memory in chunks smaller then that 
value, you might be lucky.



You can see performance improvements only if all the components of a
given IO completes  on a single core. This is hard to achieve in Ceph
as a single IO goes through multiple thread switches and the threads
are not bound to any core.  Starting an OSD with numactl  and binding
it to one core might aggravate the problem as all the threads spawned
by that OSD will compete for the CPU on a single core.  OSD with
default configuration has 20+ threads .  Binding the OSD process to
one core using taskset does not help as some memory (especially heap)
may be already allocated on the other NUMA node.

this is not true if you start the process under numactl, is it?

but binding an OSD to a NUMA domain makes sense.



Looks the design principle followed is to fan out by spawning
multiple threads at each of the pipelining stage to utilize the
available cores in the system.  Because the IOs won't complete on the
same core as issued, lots of cycles are lost for cache coherency.
is intel HT a solution/help for this? turn on HT and start the OSD on 
the L2 (e.g. with hwloc-bind)


as a more general question, the recommendation for ceph to have one cpu 
core for each OSD; can these be HT cores or actual physical cores?




stijn



Regards, Anand



-Original Message- From: ceph-users
[mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Stijn De
Weirdt Sent: Monday, September 22, 2014 2:36 PM To:
ceph-users@lists.ceph.com Subject: Re: [ceph-users] IRQ balancing,
distribution


but another issue is the OSD processes: do you pin those as well?
and how much data do they actually handle. to checksum, the OSD
process needs all data, so that can also cause a lot of NUMA
traffic, esp if they are not pinned.


That's why all my (production) storage nodes have only a single 6
or 8 core CPU. Unfortunately that also limits the amount of RAM in
there, 16GB modules have just recently become an economically
viable alternative to 8GB ones.

Thus I don't pin OSD processes, given that on my 8 core nodes with
8 OSDs and 4 journal SSDs I can make Ceph eat babies and nearly all
CPU (not IOwait!) resources with the right (or is that wrong)
tests, namely 4K FIOs.

The linux scheduler usually is quite decent in keeping processes
where the action is, thus you see for example a clear preference of
DRBD or KVM vnet processes to be near or on the CPU(s) where the
IRQs are.

the scheduler has improved recently, but i don't know since what
version (certainly not backported to RHEL6 kernel).

pinning the OSDs might actually be a bad idea, unless the page cache
is flushed before each osd restart. kernel VM has this nice feature
where allocating memory in a NUMA domain does not trigger freeing of
cache memory in the domain, but it will first try to allocate memory
on another NUMA domain. although typically the VM cache will be maxed
out on OSD boxes, i'm not sure the cache clearing itself is NUMA
aware, so who knows where the memory is located when it's allocated.


stijn ___ ceph-users
mailing list ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



PLEASE NOTE: The information contained in this electronic mail
message is intended only for the use of the designated recipient(s)
named above. If the reader of this message is not the intended
recipient, you are hereby notified that you have received this
message in error and that any review, dissemination, distribution, or
copying of this message is strictly prohibited. If you have received
this communication in error, please notify the sender by telephone or
e-mail (as shown above) immediately and destroy any and all copies of
this message in your possession (whether hard copies or
electronically stored copies).




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] IRQ balancing, distribution

2014-09-22 Thread Mark Nelson

On 09/22/2014 01:55 AM, Christian Balzer wrote:


Hello,

not really specific to Ceph, but since one of the default questions by the
Ceph team when people are facing performance problems seems to be
Have you tried turning it off and on again? ^o^ err,
Are all your interrupts on one CPU?
I'm going to wax on about this for a bit and hope for some feedback from
others with different experiences and architectures than me.


This may be a result of me harping about this after a customer's 
clusters had mysterious performance issues and where irqbalance didn't 
appear to be working properly. :)




Now firstly that question if all your IRQ handling is happening on the
same CPU is a valid one, as depending on a bewildering range of factors
ranging from kernel parameters to actual hardware one often does indeed
wind up with that scenario, usually with all on CPU0.
Which certainly is the case with all my recent hardware and Debian
kernels.


Yes, there are certainly a lot of scenarios where this can happen.  I 
think the hope has been that with MSI-X, interrupts will get evenly 
distributed by default and that is typically better than throwing them 
all at core 0, but things are still quite complicated.




I'm using nearly exclusively AMD CPUs (Opteron 42xx, 43xx and 63xx) and
thus feedback from Intel users is very much sought after, as I'm
considering Intel based storage nodes in the future.
It's vaguely amusing that Ceph storage nodes seem to have more CPU
(individual core performance, not necessarily # of cores) and similar RAM
requirements than my VM hosts. ^o^


It might be reasonable to say that Ceph is a pretty intensive piece of 
software.  With lots of OSDs on a system there are hundreds if not 
thousands of threads.  Under heavy load conditions the CPUs, network 
cards, HBAs, memory, socket interconnects, possibly SAS expanders are 
all getting worked pretty hard and possibly in unusual ways where both 
throughput and latency are important.  At the cluster scale things like 
switch bisection bandwidth and network topology become issues too.  High 
performance clustered storage is imho one of the most complicated 
performance subjects in computing.


The good news is that much of this can be avoided by sticking to simple 
designs with fewer OSDs per node.  The more OSDs you try to stick in 1 
system, the more you need to worry about all of this if you care about 
high performance.




So the common wisdom is that all IRQs on one CPU is a bad thing, lest it
gets overloaded and for example drop network packets because of this.
And while that is true, I'm hard pressed to generate any load on my
clusters where the IRQ ratio on CPU0 goes much beyond 50%.

Thus it should come as no surprise that spreading out IRQs with irqbalance
or more accurately by manually setting the /proc/irq/xx/smp_affinity mask
doesn't give me any discernible differences when it comes to benchmark
results.


Ok, that's fine, but this is pretty subjective.  Without knowing the 
load and the hardware setup I don't think we can really draw any 
conclusions other than that in your test on your hardware this wasn't 
the bottleneck.




With irqbalance spreading things out willy-nilly w/o any regards or
knowledge about the hardware and what IRQ does what it's definitely
something I won't be using out of the box. This goes especially for systems
with different NUMA regions without proper policyscripts for irqbalance.


I believe irqbalance takes PCI topology into account when making mapping 
decisions.  See:


http://dcs.nac.uci.edu/support/sysadmin/security/archive/msg09707.html



So for my current hardware I'm going to keep IRQs on CPU0 and CPU1 which
are the same Bulldozer module and thus sharing L2 and L3 cache.
In particular the AHCI (journal SSDs) and HBA or RAID controller IRQs on
CPU0 and the network (Infiniband) on CPU1.
That should give me sufficient reserves in processing power and keep intra
core (module) and NUMA (additional physical CPUs) traffic to a minimum.
This also will (within a certain load range) allow these 2 CPUs (module)
to be ramped up to full speed while other cores can remain at a lower
frequency.


So it's been a while since I looked at AMD CPU interconnect topology, 
but back in the magnycours era I drew up some diagrams:


2 socket:

https://docs.google.com/drawings/d/1_egexLqN14k9bhoN2nkv3iTgAbbPcwuwJmhwWAmakwo/edit?usp=sharing

4 socket:

https://docs.google.com/drawings/d/1V5sFSInKq3uuKRbETx1LVOURyYQF_9Z4zElPrl1YIrw/edit?usp=sharing

I think Interlagos looks somewhat similar from a hypertransport 
perspective.  My gut instinct  is that you really want to keep 
everything you can local to the socket on these kinds of systems.  So if 
your HBA is on the first socket, you want your processing and interrupt 
handling there too.  In the 4-socket configuration this is especially 
true.  It's entirely possible that you may have to go through both an 
on-die and a inter-socket HT link before you get to a neighbour 

Re: [ceph-users] IRQ balancing, distribution

2014-09-22 Thread Christian Balzer

Hello,

On Mon, 22 Sep 2014 08:55:48 -0500 Mark Nelson wrote:

 On 09/22/2014 01:55 AM, Christian Balzer wrote:
 
  Hello,
 
  not really specific to Ceph, but since one of the default questions by
  the Ceph team when people are facing performance problems seems to be
  Have you tried turning it off and on again? ^o^ err,
  Are all your interrupts on one CPU?
  I'm going to wax on about this for a bit and hope for some feedback
  from others with different experiences and architectures than me.
 
 This may be a result of me harping about this after a customer's 
 clusters had mysterious performance issues and where irqbalance didn't 
 appear to be working properly. :)
 
 
  Now firstly that question if all your IRQ handling is happening on the
  same CPU is a valid one, as depending on a bewildering range of factors
  ranging from kernel parameters to actual hardware one often does indeed
  wind up with that scenario, usually with all on CPU0.
  Which certainly is the case with all my recent hardware and Debian
  kernels.
 
 Yes, there are certainly a lot of scenarios where this can happen.  I 
 think the hope has been that with MSI-X, interrupts will get evenly 
 distributed by default and that is typically better than throwing them 
 all at core 0, but things are still quite complicated.
 
 
  I'm using nearly exclusively AMD CPUs (Opteron 42xx, 43xx and 63xx) and
  thus feedback from Intel users is very much sought after, as I'm
  considering Intel based storage nodes in the future.
  It's vaguely amusing that Ceph storage nodes seem to have more CPU
  (individual core performance, not necessarily # of cores) and similar
  RAM requirements than my VM hosts. ^o^
 
 It might be reasonable to say that Ceph is a pretty intensive piece of 
 software.  With lots of OSDs on a system there are hundreds if not 
 thousands of threads.  Under heavy load conditions the CPUs, network 
 cards, HBAs, memory, socket interconnects, possibly SAS expanders are 
 all getting worked pretty hard and possibly in unusual ways where both 
 throughput and latency are important.  At the cluster scale things like 
 switch bisection bandwidth and network topology become issues too.  High 
 performance clustered storage is imho one of the most complicated 
 performance subjects in computing.
 
Nobody will argue that. ^.^

 The good news is that much of this can be avoided by sticking to simple 
 designs with fewer OSDs per node.  The more OSDs you try to stick in 1 
 system, the more you need to worry about all of this if you care about 
 high performance.
 
I'd say that 8 OSDs isn't exactly dense (my case), but the advantages
of less densely populated nodes come with the significant price tag of
rack space and hardware costs.

 
  So the common wisdom is that all IRQs on one CPU is a bad thing, lest
  it gets overloaded and for example drop network packets because of
  this. And while that is true, I'm hard pressed to generate any load on
  my clusters where the IRQ ratio on CPU0 goes much beyond 50%.
 
  Thus it should come as no surprise that spreading out IRQs with
  irqbalance or more accurately by manually setting
  the /proc/irq/xx/smp_affinity mask doesn't give me any discernible
  differences when it comes to benchmark results.
 
 Ok, that's fine, but this is pretty subjective.  Without knowing the 
 load and the hardware setup I don't think we can really draw any 
 conclusions other than that in your test on your hardware this wasn't 
 the bottleneck.
 
Of course, I can only realistically talk about what I have tested and thus
invited feedback from others. 
I can certainly see situations where this could be an issue with Ceph and
do have experience with VM hosts that benefited from spreading IRQ
handling over more than one CPU. 

What I'm trying to get across is for people to not fall into a cargo cult
trap and think/examine things for themselves, as blindly turning on
indiscriminate IRQ balancing might do more harm than good in certain
scenarios.  

 
  With irqbalance spreading things out willy-nilly w/o any regards or
  knowledge about the hardware and what IRQ does what it's definitely
  something I won't be using out of the box. This goes especially for
  systems with different NUMA regions without proper policyscripts for
  irqbalance.
 
 I believe irqbalance takes PCI topology into account when making mapping 
 decisions.  See:
 
 http://dcs.nac.uci.edu/support/sysadmin/security/archive/msg09707.html
 

I'm sure it tries to do the right thing and it gets at least some things
right, like what my system (single Opteron 4386) looks like:
---
Package 0:  numa_node is 0 cpu mask is 00ff (load 0)
Cache domain 0:  numa_node is 0 cpu mask is 0003  (load 0) 
CPU number 0  numa_node is 0 (load 0)
CPU number 1  numa_node is 0 (load 0)
Cache domain 1:  numa_node is 0 cpu mask is 000c  (load 0) 
CPU number 2  numa_node is 0 (load 0)