Re: [ceph-users] OSD - choose the right controller card, HBA/IT mode explanation

Massimiliano Cuttini Fri, 03 Oct 2014 02:57:08 -0700


Il 02/10/2014 17:24, Christian Balzer ha scritto:

On Thu, 02 Oct 2014 12:20:06 +0200 Massimiliano Cuttini wrote:

Il 02/10/2014 03:18, Christian Balzer ha scritto:

On Wed, 01 Oct 2014 20:12:03 +0200 Massimiliano Cuttini wrote:

Hello Christian,


Il 01/10/2014 19:20, Christian Balzer ha scritto:

Hello,

On Wed, 01 Oct 2014 18:26:53 +0200 Massimiliano Cuttini wrote:

Dear all,

i need few tips about Ceph best solution for driver controller.
I'm getting confused about IT mode, RAID and JBoD.
I read many posts about don't go for RAID but use instead a JBoD
configuration.

I have 2 storage alternatives right now in my mind:

       *SuperStorage Server 2027R-E1CR24L*
       which use SAS3 via LSI 3008 AOC; IT Mode/Pass-through
       http://www.supermicro.nl/products/system/2U/2027/SSG-2027R-E1CR24L.cfm

and

       *SuperStorage Server 2027R-E1CR24N*
       which use SAS3 via LSI 3108 SAS3 AOC (in RAID mode?)
       http://www.supermicro.nl/products/system/2U/2027/SSG-2027R-E1CR24N.cfm

Firstly, both of these use an expander backplane.
So if you're planning on putting SSDs in there (even if just like 6
for journals) you may be hampered by that.
The Supermicro homepage is vague as usual and the manual doesn't
actually have a section for that backplane. I guess it will be a
4link connection, so 4x12Gb/s aka 4.8 GB/s.
If the disks all going to be HDDs you're OK, but keep that bit in
mind.

ok i was thinking about connect 24 SSD disks connected with SATA3
(6Gbps). This is why i choose a 8x SAS3 port LSI card that use double
PCI 3.0 connection, that support even (12Gbps).
This should allow me to use the full speed of the SSD (i guess).

Given the SSD speeds you cite below, SAS2 aka SATA3 would do, too.
And of course be cheaper.

Also what SSDs are you planning to deploy?

I would go with bulk of cheap consumer SSD.
I just need to perform better than HDDs, and that's all.
Everything better is just fine.

Bad idea.
Read the current "SSD MTBF" thread.
If your cluster is even remotely busy "cheap" consumer SSDs will cost you
more than top end Enterprise ones in a short time (TBW/$).
And they are so unpredictable and likely to fail that a replication of 2
is going to be very risky proposition, so increasing your cost by 1/3rd
anyway if you really care about reliability.

I read the SSD MTBF post and i don't agree to point that cheap SSD arebad (as I wrote).

The problem is not related to cheap or not, but to the size of the disk.

Having everyday 50Gb of data written on a 100Gb SSD or on a 1Tb SSD it'scompletly different.The 1st solution will last just half a year, the second will last 5years (of course they are both cheap).SSD have no unpredictable failure, they are not mechanic, they just endtheir life-cicle in a predeterminated number of writes.

Just take more space and you get more writes.
Take a SSD of 100Gb, both commercial or enterprise, is just silly IMOH.

If you can't afford a cluster made entirely of SSDs, a typical HDDs with
SSDs for journal mix is probably going to be fast enough.

Ceph at this point in time can't utilize the potential of a pure SSD
cluster anyway, see the:
"[Single OSD performance on SSD] Can't go over 3,2K IOPS"
thread.

Ok... this is a good point: "why spend a lot if you will not getperformance anyway?"

I definitly have to take into account this reccomandation.

I made this analysis:
- Total output: 8x12 = 96Gbps full speed available on the PCI3.0

That's the speed/capacity of the controller.

I'm talking about the actual backplane, where drives plug in.
And that is connected either by one cable  (and thus 48Gb/s) or two
(and thus the 96GB/s you're expecting), the documentation is unclear
on the homepage and not in the manual of that server. Digging around I
found http://www.supermicro.com.tw/manuals/other/BPN-SAS3-216EL.pdf
which suggests two ports, so your basic assumptions are correct.

This is what is wrote for the backpane: One SATA backplane
(BPN-SAS3-216EL1) /SAS3 2.5" drive slots and 4x mini-SAS3 HD connectors
for SAS3 uplink/downlink//
//It support 4x port mini SAS3 HD connector.//
//This because there are somebody that will buy a AOC LSI card to speed
up further the backpane.//
/
It understood that support 1 or 2 expander card, each one with 4x mini
SAS3 cable.
2 cards daughter cards to have failover on the backplane (however this
storage come with just 1 port).
Then should be 4x 12Gb/s ? I'm getting confused.

No, read that PDF closely.
The single expander card of that server backplane has 2 uplink ports. Each
port usually (and in this case pretty much certainly) has 4 lanes at
12Gb/s each.

Definitly thank you! I'm not a hardware guru and i couldn't understood that.
Thanks you heartened me! :)

But verify that with your Supermicro vendor and read up about SAS/SATA
expanders.

If you want/need full speed, the only option with Supermicro seems to
be
http://www.supermicro.com.tw/products/chassis/2U/216/SC216BAC-R920LP.cfm
at this time for SAS3.

That backplane (BPN-SAS3-216A) come for 300$ while the one on the
storage worth 600$ (BPN-SAS3-216EL1).
I think that they are both great, however i cannot choose the backlplane
for that model.

Build it yourself or have your vendor do a BTO, Build To Order.

As I wrote, I'm not a HW guru... I'm afraid to build a not working config.

I need at last an half finished solution. I saw across "storage solutionproposed by supermicro", i think they are fine.

Of course a direct connect backplane chassis with SAS2/SATA3 will do
fine as I wrote above, like this one.
http://www.supermicro.com.tw/products/chassis/2U/216/SC216BA-R1K28LP.cfm

In either case get the fastest motherboard/CPUs (Ceph will need those
for SSDs) and the appropriate controller(s). If you're unwilling to
build them yourself, I'm sure some vendor will do BTO. ^^

I cannot change the motherboard (but seems really good!).

Why?
Not being able to purchase the optimum solution (especially when it is
CHEAPER!) strikes me as odd...

You are right!  -_-

But I'm not aware how to compare 2 motherboards and what are the keyfactor to take in account.

... if you have some suggestions are more than welcome! :)

About CPUs i decided to go for a double E5-2620.
http://ark.intel.com/products/64594/Intel-Xeon-Processor-E5-2620-15M-Cache-2_00-GHz-7_20-GTs-Intel-QPI
Not so fast .... i went for quantity instead for quality (12cores will
be enought, no?).
Do you think i need to change it with something better?

If all your OSDs are all going to be SSDs, yes.

Ouch... é_è
...what is the issue with that CPU? ...too slow or too few cores?
http://ark.intel.com/search/advanced?FamilyText=Intel%C2%AE%20Xeon%C2%AE%20Processor%20E5%20v2%20Family
Help me!

RAM is 4x 8Gb = 32gb

Barely enough, if you should go for a mixed HDD/SSD setup, add as much
RAM as you can afford, it will speed up things and reads in particular.

Have a look at:
https://objects.dreamhost.com/inktankweb/Inktank_Hardware_Configuration_Guide.pdf

This document seems to be well done
Thanks you really do help me! :)

- Than i should have at least for each disk a maximum speed of
96Gbps/24 disks which 4Gbps each disk
- The disks are SATA3 6Gbps than i should have here a little
bootleneck that lower me at 4Gbps.
- However a common SSD never hit the interface speed, the tend to be
at 450MB/s.

Average speed of a SSD:
Min     Avg     Max
369     Read 485        522
162     Write 428       504
223     Mixed 449       512


Then having a bottleneck to 4Gbps (which mean 400MB/s) should be fine
(should only if I'm not in wrong).
Is it right what i thougth?

Also expanders introduce some level of overhead, so you're probably
going to wind up with less than 400MB/s per drive.

Is it good 400MB/s per drive?
I don't think that a SAS HDD would even reach this speed.

A HDD not, SSDs certainly could depending on the model. I thought you were
going to deploy all SSDs?

Only SSDs.

I think that the only bottleneck here is the 4x1Gb ethernet
connection.

With a firebreathing storage server like that, you definitely do NOT
want to limit yourself to 1Gb/s links. The latency of these links,
never mind bandwidth will render all your investment in the storage
nodes rather moot.

Even if your clients would not be on something faster, for replication
at least use 10Gb/s Ethernet or my favorite (price and performance
wise), Infiniband.

I read something about infiniband but i really don't know much.
If you have some usefull link i will take a further look.

Search the ML archives for Infiniband for starters, read up on it on
Wikipedia, compare prices.

But I'll need to change all the switches to infiniband to support it?
Do I need to connect every CPU node with infiniband too?

Is it so or I'm completly wasting my time on useless specs?

It might be a good idea to tell us what your actual plans are.
As in, how many nodes (these are quite dense ones with 24 drives!),
how much storage in total, what kind of use pattern, clients.

Right now we are just testing and experimenting.
We would start with a non-production environment with 2 nodes, learn
Cephs in depth and then replicate test&findings on other 2 nodes,
upgrade it to 10GB ethernet and go live.

Given that you're aiming for all SSDs, definitely consider Infiniband
for the backend (replication network) at least.
It's cheaper/faster and also will have more native support (thus even
faster) in upcoming Ceph releases.
Failing that, definitely dedicated client and replication networks,
each with 2x10Gb/s bonded links to get somewhere close to your storage
abilities/bandwidth.

I have 3 options:

   * add another 4x1Gb card (cheap but cost many port on the switch -
     2x4x port for 1 storage + 1of management)
   * add a 2x10Gb card (expensive but probably necessary)
   * investigate further Infiniband

Again, what is the point of having a super fast storage node all based on
SSDs when your network is slow (latency, thus cutting into your IOPS) and
can't use even 10% of the bandwidth the storage system could deliver?

I know, but i can upgrade networks later... no?
(how can I measure network performance issues while i'm growing?)

Next consider the HA aspects of your cluster. Aside from the obvious
like having redundant power feeds and network links/switches, what
happens if a storage node fails?
If you're starting with 2 nodes, that's risky in and by itself (also
deploy at least 3 mons).

If you start with 4 nodes, if one goes down the default behavior of
Ceph would be to redistribute the data on the 3 remaining nodes to
maintain the replication level (a level of 2 is probably acceptable
with the right kind of SSDs).
Now what means is a LOT of traffic for the replication, potentially
impacting your performance depending on the configuration options and
actual hardware used. It also means your "near full" settings should
be at 70% or lower, because otherwise a node failure could result in
full OSDs and thus a blocked cluster. And of course after the data is
rebalanced the lack of one node means that your cluster is about 25%
slower than before.

This settings are good to me. I don't expect node failure to be the
standard.

Nobody expects the Spanish inquisition. Or Mr. Murphy.
Being aware of what happens in case it does happen goes a long way.
And with your 2 node test cluster you can't even test this!

I don't want to test the failure workload. I need to learn how toinstall Ceph.

I cannot buy x8 storage server without ever started to deploy Ceph once.
This test is for start. x4 will be for a small production environment.
I count on double it within a year, but we need to do little steps.

To me running 25% slower is just nothing instead of don't running at all.

If the recovery traffic is too much for your cluster (network, CPUs,
disks), it will be pretty much the same thing.
And if your cluster gets full because it was over 70% capacity when that
node failed, it IS the same thing.

I don't think already to fill it up all the bays.

I will start with half filled bays and growing by adding more OSDinstead of disks.However 24bays give me room to grow just by adding a disk for each bayfrom time to time.

The most common and from a pure HA perspective sensible suggestion is
to start with enough nodes that a failure won't have too much impact,
but that of course is also the most expensive option. ^^

Yes! But "Expensive" is not an option in this epoch :)
I need to be effective

You can only be effective once you know all the components (HW and SW) as
well as the environment (client I/O mostly).

This is the point.
I need to make 1 step for HW and 1 step for SW.

This is just the starting point... but we have in front a long way tolearn the SW.

Let's see.


Max

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD - choose the right controller card, HBA/IT mode explanation

Reply via email to