Re: [ceph-users] OSD - choose the right controller card, HBA/IT mode explanation

Christian Balzer Thu, 02 Oct 2014 08:25:14 -0700

On Thu, 02 Oct 2014 12:20:06 +0200 Massimiliano Cuttini wrote:

> 
> Il 02/10/2014 03:18, Christian Balzer ha scritto:
> > On Wed, 01 Oct 2014 20:12:03 +0200 Massimiliano Cuttini wrote:
> >
> >> Hello Christian,
> >>
> >>
> >> Il 01/10/2014 19:20, Christian Balzer ha scritto:
> >>> Hello,
> >>>
> >>> On Wed, 01 Oct 2014 18:26:53 +0200 Massimiliano Cuttini wrote:
> >>>
> >>>> Dear all,
> >>>>
> >>>> i need few tips about Ceph best solution for driver controller.
> >>>> I'm getting confused about IT mode, RAID and JBoD.
> >>>> I read many posts about don't go for RAID but use instead a JBoD
> >>>> configuration.
> >>>>
> >>>> I have 2 storage alternatives right now in my mind:
> >>>>
> >>>>       *SuperStorage Server 2027R-E1CR24L*
> >>>>       which use SAS3 via LSI 3008 AOC; IT Mode/Pass-through
> >>>>       
> >>>> http://www.supermicro.nl/products/system/2U/2027/SSG-2027R-E1CR24L.cfm
> >>>>
> >>>> and
> >>>>
> >>>>       *SuperStorage Server 2027R-E1CR24N*
> >>>>       which use SAS3 via LSI 3108 SAS3 AOC (in RAID mode?)
> >>>>       
> >>>> http://www.supermicro.nl/products/system/2U/2027/SSG-2027R-E1CR24N.cfm
> >>>>
> >>> Firstly, both of these use an expander backplane.
> >>> So if you're planning on putting SSDs in there (even if just like 6
> >>> for journals) you may be hampered by that.
> >>> The Supermicro homepage is vague as usual and the manual doesn't
> >>> actually have a section for that backplane. I guess it will be a
> >>> 4link connection, so 4x12Gb/s aka 4.8 GB/s.
> >>> If the disks all going to be HDDs you're OK, but keep that bit in
> >>> mind. 
> >> ok i was thinking about connect 24 SSD disks connected with SATA3
> >> (6Gbps). This is why i choose a 8x SAS3 port LSI card that use double
> >> PCI 3.0 connection, that support even (12Gbps).
> >> This should allow me to use the full speed of the SSD (i guess).
> >>
> > Given the SSD speeds you cite below, SAS2 aka SATA3 would do, too.
> > And of course be cheaper.
> >
> > Also what SSDs are you planning to deploy?
> I would go with bulk of cheap consumer SSD.
> I just need to perform better than HDDs, and that's all.
> Everything better is just fine.


Bad idea.
Read the current "SSD MTBF" thread.
If your cluster is even remotely busy "cheap" consumer SSDs will cost you
more than top end Enterprise ones in a short time (TBW/$). 
And they are so unpredictable and likely to fail that a replication of 2
is going to be very risky proposition, so increasing your cost by 1/3rd
anyway if you really care about reliability.

If you can't afford a cluster made entirely of SSDs, a typical HDDs with
SSDs for journal mix is probably going to be fast enough.

Ceph at this point in time can't utilize the potential of a pure SSD
cluster anyway, see the:
"[Single OSD performance on SSD] Can't go over 3,2K IOPS"
thread.

> >> I made this analysis:
> >> - Total output: 8x12 = 96Gbps full speed available on the PCI3.0
> > That's the speed/capacity of the controller.
> >
> > I'm talking about the actual backplane, where drives plug in.
> > And that is connected either by one cable  (and thus 48Gb/s) or two
> > (and thus the 96GB/s you're expecting), the documentation is unclear
> > on the homepage and not in the manual of that server. Digging around I
> > found http://www.supermicro.com.tw/manuals/other/BPN-SAS3-216EL.pdf
> > which suggests two ports, so your basic assumptions are correct.
> 
> This is what is wrote for the backpane: One SATA backplane
> (BPN-SAS3-216EL1) /SAS3 2.5" drive slots and 4x mini-SAS3 HD connectors
> for SAS3 uplink/downlink//
> //It support 4x port mini SAS3 HD connector.//
> //This because there are somebody that will buy a AOC LSI card to speed 
> up further the backpane.//
> /
> It understood that support 1 or 2 expander card, each one with 4x mini 
> SAS3 cable.
> 2 cards daughter cards to have failover on the backplane (however this 
> storage come with just 1 port).
> Then should be 4x 12Gb/s ? I'm getting confused.
> 
No, read that PDF closely.
The single expander card of that server backplane has 2 uplink ports. Each
port usually (and in this case pretty much certainly) has 4 lanes at
12Gb/s each. 

> > But verify that with your Supermicro vendor and read up about SAS/SATA
> > expanders.
> >
> > If you want/need full speed, the only option with Supermicro seems to
> > be
> > http://www.supermicro.com.tw/products/chassis/2U/216/SC216BAC-R920LP.cfm
> > at this time for SAS3.
> That backplane (BPN-SAS3-216A) come for 300$ while the one on the 
> storage worth 600$ (BPN-SAS3-216EL1).
> I think that they are both great, however i cannot choose the backlplane 
> for that model.
> 
Build it yourself or have your vendor do a BTO, Build To Order.

> > Of course a direct connect backplane chassis with SAS2/SATA3 will do
> > fine as I wrote above, like this one.
> > http://www.supermicro.com.tw/products/chassis/2U/216/SC216BA-R1K28LP.cfm
> >
> > In either case get the fastest motherboard/CPUs (Ceph will need those
> > for SSDs) and the appropriate controller(s). If you're unwilling to
> > build them yourself, I'm sure some vendor will do BTO. ^^
> 
> I cannot change the motherboard (but seems really good!).
Why?
Not being able to purchase the optimum solution (especially when it is
CHEAPER!) strikes me as odd...


> About CPUs i decided to go for a double E5-2620.
> http://ark.intel.com/products/64594/Intel-Xeon-Processor-E5-2620-15M-Cache-2_00-GHz-7_20-GTs-Intel-QPI
> Not so fast .... i went for quantity instead for quality (12cores will 
> be enought, no?).
> Do you think i need to change it with something better?
If all your OSDs are all going to be SSDs, yes.

> RAM is 4x 8Gb = 32gb
>
Barely enough, if you should go for a mixed HDD/SSD setup, add as much
RAM as you can afford, it will speed up things and reads in particular.

Have a look at: 
https://objects.dreamhost.com/inktankweb/Inktank_Hardware_Configuration_Guide.pdf

> >> - Than i should have at least for each disk a maximum speed of
> >> 96Gbps/24 disks which 4Gbps each disk
> >> - The disks are SATA3 6Gbps than i should have here a little
> >> bootleneck that lower me at 4Gbps.
> >> - However a common SSD never hit the interface speed, the tend to be
> >> at 450MB/s.
> >>
> >> Average speed of a SSD:
> >> Min        Avg     Max
> >> 369        Read 485        522
> >> 162        Write 428       504
> >> 223        Mixed 449       512
> >>
> >>
> >> Then having a bottleneck to 4Gbps (which mean 400MB/s) should be fine
> >> (should only if I'm not in wrong).
> >> Is it right what i thougth?
> >>
> > Also expanders introduce some level of overhead, so you're probably
> > going to wind up with less than 400MB/s per drive.
> Is it good 400MB/s per drive?
> I don't think that a SAS HDD would even reach this speed.
> 
A HDD not, SSDs certainly could depending on the model. I thought you were
going to deploy all SSDs?

> >> I think that the only bottleneck here is the 4x1Gb ethernet
> >> connection.
> >>
> > With a firebreathing storage server like that, you definitely do NOT
> > want to limit yourself to 1Gb/s links. The latency of these links,
> > never mind bandwidth will render all your investment in the storage
> > nodes rather moot.
> >
> > Even if your clients would not be on something faster, for replication
> > at least use 10Gb/s Ethernet or my favorite (price and performance
> > wise), Infiniband.
> 
> I read something about infiniband but i really don't know much.
> If you have some usefull link i will take a further look.
Search the ML archives for Infiniband for starters, read up on it on
Wikipedia, compare prices.

> After this test session we'll need to go live with a full speed, every 
> suggestion will be appreciated.
> 
> >
> >>>> Ok both of them solution should support JBoD.
> >>>> However I read that only a LSI with HBA or/and flashed in IT MODE
> >>>> allow to:
> >>>>
> >>>>     * "plug&play" a new driver and see it already on a linux
> >>>> distribution (without recheck disks)
> >>>>     * see S.M.A.R.T. data (because there is no volume layer between
> >>>>       motherboard and disks)
> >>> smartctl can handle handle the LSI RAID stuff fine.
> >> Good
> >>
> >>>>     * reduce the disk latency
> >>>>
> >>> Not sure about that, depending on the actual RAID and configuration
> >>> any cache of the RAID subsystem might get used, so improving things.
> >>>
> >>> The most important reason to use IT for me would be in conjunction
> >>> with SSDs, none of the RAIDs I'm aware allow for TRIM/DISCARD. to
> >>> work.
> >> Did you know if i can flash the LSI 3108 to IT mode?
> >>
> > Don't know, given that the 2108 can't, I would expect the answer to be
> > no.
> >
> >>>> Then i should probably avoid LSI 3108 (which have a RAID config by
> >>>> default) and go for the LSI 3008 (already flashed in IT mode).
> >>>>
> >>> Of the 2 I would pick the IT mode one for a "classic" Ceph
> >>> deployment.
> >> Ok, but why?
> > Because you're using SSDs for starters and thus REALLY want a HBA, IT
> > mode. And because it is cheaper, more straightforward.
> Good point :)
> > Also having to create 24 single disk RAID0 volumes with certain
> > controllers (and the 3108 is among them if it is anything like the
> > 2108) is a pain.
> So it's true.... i have to format every disk with RAID0 because instead 
> cannot see the disk as JBoD?
> I cannot "plug&play" then.... ?
Not as straightforward as with IT mode or a "real" JBOD mode controller,
yes.

> >
> > Other controllers will automatically make their onboard cache
> > available in JBOD mode, like Areca. So you get the best of both
> > worlds, at a price of course.
> Then it's a just a problem of LSI, and not of other brand?
It's not a problem with LSI if the card/chip supports IT mode.
Not that many other brands out there to boot aside from ARECA and Adaptec
(don't like the latter).

> >>>> Is it so or I'm completly wasting my time on useless specs?
> >>> It might be a good idea to tell us what your actual plans are.
> >>> As in, how many nodes (these are quite dense ones with 24 drives!),
> >>> how much storage in total, what kind of use pattern, clients.
> >> Right now we are just testing and experimenting.
> >> We would start with a non-production environment with 2 nodes, learn
> >> Cephs in depth and then replicate test&findings on other 2 nodes,
> >> upgrade it to 10GB ethernet and go live.
> > Given that you're aiming for all SSDs, definitely consider Infiniband
> > for the backend (replication network) at least.
> > It's cheaper/faster and also will have more native support (thus even
> > faster) in upcoming Ceph releases.
> > Failing that, definitely dedicated client and replication networks,
> > each with 2x10Gb/s bonded links to get somewhere close to your storage
> > abilities/bandwidth.
> I have 3 options:
> 
>   * add another 4x1Gb card (cheap but cost many port on the switch -
>     2x4x port for 1 storage + 1of management)
>   * add a 2x10Gb card (expensive but probably necessary)
>   * investigate further Infiniband
> 
Again, what is the point of having a super fast storage node all based on
SSDs when your network is slow (latency, thus cutting into your IOPS) and
can't use even 10% of the bandwidth the storage system could deliver?

> > Next consider the HA aspects of your cluster. Aside from the obvious
> > like having redundant power feeds and network links/switches, what
> > happens if a storage node fails?
> > If you're starting with 2 nodes, that's risky in and by itself (also
> > deploy at least 3 mons).
> >
> > If you start with 4 nodes, if one goes down the default behavior of
> > Ceph would be to redistribute the data on the 3 remaining nodes to
> > maintain the replication level (a level of 2 is probably acceptable
> > with the right kind of SSDs).
> > Now what means is a LOT of traffic for the replication, potentially
> > impacting your performance depending on the configuration options and
> > actual hardware used. It also means your "near full" settings should
> > be at 70% or lower, because otherwise a node failure could result in
> > full OSDs and thus a blocked cluster. And of course after the data is
> > rebalanced the lack of one node means that your cluster is about 25%
> > slower than before.
> This settings are good to me. I don't expect node failure to be the 
> standard.
Nobody expects the Spanish inquisition. Or Mr. Murphy. 
Being aware of what happens in case it does happen goes a long way.
And with your 2 node test cluster you can't even test this!

> To me running 25% slower is just nothing instead of don't running at all.
If the recovery traffic is too much for your cluster (network, CPUs,
disks), it will be pretty much the same thing.
And if your cluster gets full because it was over 70% capacity when that
node failed, it IS the same thing.

> I expect to setup a solution for the broken node within 24h and then 
> turn back to 100% performance.
> 
> > There are many threads in this ML that touch on this subject, with
> > various suggestions on how to minimize or negate the impact of a node
> > failure.
> My objective is just to negate the impact of complete stop of the
> webfarm. Having the best performance while everything is good and have
> enought time to repair nodes when something goes wrong.
For that you need to have enough nodes that can fail. 

> > The most common and from a pure HA perspective sensible suggestion is
> > to start with enough nodes that a failure won't have too much impact,
> > but that of course is also the most expensive option. ^^
> Yes! But "Expensive" is not an option in this epoch :)
> I need to be effective

You can only be effective once you know all the components (HW and SW) as
well as the environment (client I/O mostly).

Christian
-- 
Christian Balzer        Network/Systems Engineer                
[email protected]           Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD - choose the right controller card, HBA/IT mode explanation

Reply via email to