I'm still a noob too, so don't take anything I say with much weight. I
was hoping that somebody with more experience would reply.
I see a few potential problems.
With that CPU to disk ratio, you're going to need to slow recovery down
a lot to make sure you have enough CPU available after a node reboots.
You may need to tune it down even further in the event that a node
fails. I haven't tested a CPU starvation situation, but I suspect that
bad things would happen. You might get stuck with OSDs not responding
fast enough, so they get marked down, which triggers a recovery, which
uses more CPU, etc. I'm not even sure how you'd get out of that
situation if it started.
Regarding I/O, your writes being sequential won't matter. By using
journals on the HDDs, all I/O becomes random I/O. You have a lot of
spindles though. Doing some quick estimate in my head, I figure that
you realistically have 200 MBps of I/O per node. It seems pretty low
compared to the combined sequential write speed of 3.6 GBps. Just
remember that every write to an OSD is really two writes, which means
you're doing random IO. 10 MBps per disk, divided by the 2 writes,
becomes 5 MBps per disk. Plus the latency of sending the data over the
network to 1 or 2 other disks that have the same constraints.
With replication = 2, that's 100 MBps per node. That ends up being
(best case) about 800 Mbps of RadosGW writes. Hotspotting and uneven
distribution on the nodes will lower that number. If 1 Gbps writes per
node are a hard requirement, I think you're going to be disappointed.
If your application requirements are lower, then you should be ok.
Regarding latency, it's hard to get specific. Just remember that your
data is being stripped across many disks. So the latency of one RadosGW
operation some where between the max latency of the OSDs, and the sum of
the latency of the OSDs. Like I said, hard to be specific. To begin
with latency will just increase as the load increases. But at a certain
point, problems will start. OSDs will block because another OSDs won't
write it's data. Your RadosGW load balancer might mark RadosGW nodes
down because they're unresponsive. OSDs might kick other OSDs out
because they're too slow. Most of my Ceph headaches involve too much
latency.
Overall, I think you'll be ok, unless you absolutely have to have that 1
Gbps write speed per node. Even so, you'll need to prove it. You
really want to load up the cluster with a real amount of data, then
simulate a failure and recovery under normal load. Shut a node down for
a day, then bring it back up. Stuff like that. A real production
failure will stress things differently than `ceph bench` does. I made
the mistake of testing without enough data. Things worked great when
the cluster was 5% used, but had problems when the cluster was 60% used.
On 5/9/14 04:18 , Cédric Lemarchand wrote:
An other thought, I would hope that with EC, data chunks spreads would
profits of each drives writes capability where there will be stored.
I did not get any rely for now ! Does this kind of configuration (hard
& soft) looks crazy ?! Am I missing something ?
Looking forward for your comments, thanks in advance.
--
Cédric Lemarchand
Le 7 mai 2014 à 22:10, Cedric Lemarchand <[email protected]
<mailto:[email protected]>> a écrit :
Some more details, the io pattern will be around 90%write 10%read,
mainly sequential.
Recent posts shows that max_backfills, recovery_max_active and
recovery_op_priority settings will be helpful in case of
backfilling/re balancing.
Any thoughts on such hardware setup ?
Le 07/05/2014 11:43, Cedric Lemarchand a écrit :
Hello,
This build is only intended for archiving purpose, what matter here
is lowering ratio $/To/W.
Access to the storage would be via radosgw, installed on each nodes.
I need that each nodes sustain an average of 1Gb write rates, for
which I think it would not be a problem. Erasure encoding will be
used with something like k=12 m=3.
A typical node would be :
- Supermicro 36 bays
- 2x Xeon E5-2630Lv2
- 96Go ram (recommended ratio 1Go/To for OSD is lowered a bit ... )
- HBA LSI adaptaters, JBOD mode, could be 2x 9207-8i
- 36 HDD 4To with default journals config
- dedicated bonded 2Gb links for public/private networks
(backfilling will takes ages if a full node is lost ...)
I think in an *optimal* state (ceph healthy), it could handle the
job. Waiting for your comment.
What is bothering me more is cases of OSD maintenance operations
like backfilling and cluster re balancing, where nodes will be put
under very hight IO/memory and CPU load during hours/days. Does the
latency will *just* grow up, or does everything will fly away ?
(OOMK spawn, OSD suicides because of latency, node pushed out of the
cluster, ect ... )
As you understand I am trying to design the cluster with in mind a
sweet spot like "things becomes slow, latency grow up, but the node
stay stable/usable and aren't pushed out of the cluster".
This is my first jump into Ceph, so any inputs will be greatly
appreciated ;-)
Cheers,
--
Cédric
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
Cédric
_______________________________________________
ceph-users mailing list
[email protected] <mailto:[email protected]>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
--
*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email [email protected] <mailto:[email protected]>
*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website <http://www.centraldesktop.com/> | Twitter
<http://www.twitter.com/centraldesktop> | Facebook
<http://www.facebook.com/CentralDesktop> | LinkedIn
<http://www.linkedin.com/groups?gid=147417> | Blog
<http://cdblog.centraldesktop.com/>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com