Il 02/10/2014 03:18, Christian Balzer ha scritto:
On Wed, 01 Oct 2014 20:12:03 +0200 Massimiliano Cuttini wrote:
Hello Christian,
Il 01/10/2014 19:20, Christian Balzer ha scritto:
Hello,
On Wed, 01 Oct 2014 18:26:53 +0200 Massimiliano Cuttini wrote:
Dear all,
i need few tips about Ceph best solution for driver controller.
I'm getting confused about IT mode, RAID and JBoD.
I read many posts about don't go for RAID but use instead a JBoD
configuration.
I have 2 storage alternatives right now in my mind:
*SuperStorage Server 2027R-E1CR24L*
which use SAS3 via LSI 3008 AOC; IT Mode/Pass-through
http://www.supermicro.nl/products/system/2U/2027/SSG-2027R-E1CR24L.cfm
and
*SuperStorage Server 2027R-E1CR24N*
which use SAS3 via LSI 3108 SAS3 AOC (in RAID mode?)
http://www.supermicro.nl/products/system/2U/2027/SSG-2027R-E1CR24N.cfm
Firstly, both of these use an expander backplane.
So if you're planning on putting SSDs in there (even if just like 6 for
journals) you may be hampered by that.
The Supermicro homepage is vague as usual and the manual doesn't
actually have a section for that backplane. I guess it will be a 4link
connection, so 4x12Gb/s aka 4.8 GB/s.
If the disks all going to be HDDs you're OK, but keep that bit in mind.
ok i was thinking about connect 24 SSD disks connected with SATA3
(6Gbps). This is why i choose a 8x SAS3 port LSI card that use double
PCI 3.0 connection, that support even (12Gbps).
This should allow me to use the full speed of the SSD (i guess).
Given the SSD speeds you cite below, SAS2 aka SATA3 would do, too.
And of course be cheaper.
Also what SSDs are you planning to deploy?
I would go with bulk of cheap consumer SSD.
I just need to perform better than HDDs, and that's all.
Everything better is just fine.
I made this analysis:
- Total output: 8x12 = 96Gbps full speed available on the PCI3.0
That's the speed/capacity of the controller.
I'm talking about the actual backplane, where drives plug in.
And that is connected either by one cable (and thus 48Gb/s) or two (and
thus the 96GB/s you're expecting), the documentation is unclear on the
homepage and not in the manual of that server. Digging around I found
http://www.supermicro.com.tw/manuals/other/BPN-SAS3-216EL.pdf
which suggests two ports, so your basic assumptions are correct.
This is what is wrote for the backpane: One SATA backplane (BPN-SAS3-216EL1)
/SAS3 2.5" drive slots and 4x mini-SAS3 HD connectors for SAS3
uplink/downlink//
//It support 4x port mini SAS3 HD connector.//
//This because there are somebody that will buy a AOC LSI card to speed
up further the backpane.//
/
It understood that support 1 or 2 expander card, each one with 4x mini
SAS3 cable.
2 cards daughter cards to have failover on the backplane (however this
storage come with just 1 port).
Then should be 4x 12Gb/s ? I'm getting confused.
But verify that with your Supermicro vendor and read up about SAS/SATA
expanders.
If you want/need full speed, the only option with Supermicro seems to be
http://www.supermicro.com.tw/products/chassis/2U/216/SC216BAC-R920LP.cfm
at this time for SAS3.
That backplane (BPN-SAS3-216A) come for 300$ while the one on the
storage worth 600$ (BPN-SAS3-216EL1).
I think that they are both great, however i cannot choose the backlplane
for that model.
Of course a direct connect backplane chassis with SAS2/SATA3 will do fine
as I wrote above, like this one.
http://www.supermicro.com.tw/products/chassis/2U/216/SC216BA-R1K28LP.cfm
In either case get the fastest motherboard/CPUs (Ceph will need those for
SSDs) and the appropriate controller(s). If you're unwilling to build them
yourself, I'm sure some vendor will do BTO. ^^
I cannot change the motherboard (but seems really good!).
About CPUs i decided to go for a double E5-2620.
http://ark.intel.com/products/64594/Intel-Xeon-Processor-E5-2620-15M-Cache-2_00-GHz-7_20-GTs-Intel-QPI
Not so fast .... i went for quantity instead for quality (12cores will
be enought, no?).
Do you think i need to change it with something better?
RAM is 4x 8Gb = 32gb
- Than i should have at least for each disk a maximum speed of 96Gbps/24
disks which 4Gbps each disk
- The disks are SATA3 6Gbps than i should have here a little bootleneck
that lower me at 4Gbps.
- However a common SSD never hit the interface speed, the tend to be at
450MB/s.
Average speed of a SSD:
Min Avg Max
369 Read 485 522
162 Write 428 504
223 Mixed 449 512
Then having a bottleneck to 4Gbps (which mean 400MB/s) should be fine
(should only if I'm not in wrong).
Is it right what i thougth?
Also expanders introduce some level of overhead, so you're probably going
to wind up with less than 400MB/s per drive.
Is it good 400MB/s per drive?
I don't think that a SAS HDD would even reach this speed.
I think that the only bottleneck here is the 4x1Gb ethernet connection.
With a firebreathing storage server like that, you definitely do NOT want
to limit yourself to 1Gb/s links. The latency of these links, never mind
bandwidth will render all your investment in the storage nodes rather moot.
Even if your clients would not be on something faster, for replication at
least use 10Gb/s Ethernet or my favorite (price and performance wise),
Infiniband.
I read something about infiniband but i really don't know much.
If you have some usefull link i will take a further look.
After this test session we'll need to go live with a full speed, every
suggestion will be appreciated.
Ok both of them solution should support JBoD.
However I read that only a LSI with HBA or/and flashed in IT MODE
allow to:
* "plug&play" a new driver and see it already on a linux
distribution (without recheck disks)
* see S.M.A.R.T. data (because there is no volume layer between
motherboard and disks)
smartctl can handle handle the LSI RAID stuff fine.
Good
* reduce the disk latency
Not sure about that, depending on the actual RAID and configuration any
cache of the RAID subsystem might get used, so improving things.
The most important reason to use IT for me would be in conjunction with
SSDs, none of the RAIDs I'm aware allow for TRIM/DISCARD. to work.
Did you know if i can flash the LSI 3108 to IT mode?
Don't know, given that the 2108 can't, I would expect the answer to be no.
Then i should probably avoid LSI 3108 (which have a RAID config by
default) and go for the LSI 3008 (already flashed in IT mode).
Of the 2 I would pick the IT mode one for a "classic" Ceph deployment.
Ok, but why?
Because you're using SSDs for starters and thus REALLY want a HBA, IT mode.
And because it is cheaper, more straightforward.
Good point :)
Also having to create 24 single disk RAID0 volumes with certain controllers
(and the 3108 is among them if it is anything like the 2108) is a pain.
So it's true.... i have to format every disk with RAID0 because instead
cannot see the disk as JBoD?
I cannot "plug&play" then.... ?
Other controllers will automatically make their onboard cache available in
JBOD mode, like Areca. So you get the best of both worlds, at a price of
course.
Then it's a just a problem of LSI, and not of other brand?
Is it so or I'm completly wasting my time on useless specs?
It might be a good idea to tell us what your actual plans are.
As in, how many nodes (these are quite dense ones with 24 drives!), how
much storage in total, what kind of use pattern, clients.
Right now we are just testing and experimenting.
We would start with a non-production environment with 2 nodes, learn
Cephs in depth and then replicate test&findings on other 2 nodes,
upgrade it to 10GB ethernet and go live.
Given that you're aiming for all SSDs, definitely consider Infiniband for
the backend (replication network) at least.
It's cheaper/faster and also will have more native support (thus even
faster) in upcoming Ceph releases.
Failing that, definitely dedicated client and replication networks, each
with 2x10Gb/s bonded links to get somewhere close to your storage
abilities/bandwidth.
I have 3 options:
* add another 4x1Gb card (cheap but cost many port on the switch -
2x4x port for 1 storage + 1of management)
* add a 2x10Gb card (expensive but probably necessary)
* investigate further Infiniband
Next consider the HA aspects of your cluster. Aside from the obvious like
having redundant power feeds and network links/switches, what happens if a
storage node fails?
If you're starting with 2 nodes, that's risky in and by itself (also
deploy at least 3 mons).
If you start with 4 nodes, if one goes down the default behavior of Ceph
would be to redistribute the data on the 3 remaining nodes to maintain the
replication level (a level of 2 is probably acceptable with the right kind
of SSDs).
Now what means is a LOT of traffic for the replication, potentially
impacting your performance depending on the configuration options and
actual hardware used. It also means your "near full" settings should be at
70% or lower, because otherwise a node failure could result in full OSDs
and thus a blocked cluster. And of course after the data is rebalanced the
lack of one node means that your cluster is about 25% slower than before.
This settings are good to me. I don't expect node failure to be the
standard.
To me running 25% slower is just nothing instead of don't running at all.
I expect to setup a solution for the broken node within 24h and then
turn back to 100% performance.
There are many threads in this ML that touch on this subject, with
various suggestions on how to minimize or negate the impact of a node
failure.
My objective is just to negate the impact of complete stop of the webfarm.
Having the best performance while everything is good and have enought
time to repair nodes when something goes wrong.
The most common and from a pure HA perspective sensible suggestion is to
start with enough nodes that a failure won't have too much impact, but
that of course is also the most expensive option. ^^
Yes! But "Expensive" is not an option in this epoch :)
I need to be effective
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com